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We first study the properties of solutions of quadratic programs 
with linear equality constraints whose parameters are estimated from 
data in the high-dimensional setting where p, the number of variables 
in the problem, is of the same order of magnitude as n, the number of 
observations used to estimate the parameters. The Markowitz prob- 
lem in Finance is a subcase of our study. Assuming normality and 
independence of the observations we relate the efficient frontier com- 
puted empirically to the "true" efficient frontier. Our computations 
show that there is a separation of the errors induced by estimating 
the mean of the observations and estimating the covariance matrix. 
In particular, the price paid for estimating the covariance matrix is an 
underestimation of the variance by a factor roughly equal to 1 —p/n. 
Therefore the risk of the optimal population solution is underesti- 
mated when we estimate it by solving a similar quadratic program 
with estimated parameters. 

We also characterize the statistical behavior of linear functionals of 
the empirical optimal vector and show that they are biased estimators 
of the corresponding population quantities. 

We investigate the robustness of our Gaussian results by extend- 
ing the study to certain elliptical models and models where our n 
observations are correlated (in "time"). We show a lack of robustness 
of the Gaussian results, but are still able to get results concerning 
first order properties of the quantities of interest, even in the case of 
relatively heavy-tailed data (we require two moments). Risk under- 
estimation is still present in the elliptical case and more pronounced 
than in the Gaussian case. 
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We discuss properties of the nonparametric and parametric boot- 
strap in this context. We show several results, including the interest- 
ing fact that standard applications of the bootstrap generally yield 
inconsistent estimates of bias. 

We propose some strategies to correct these problems and practi- 
cally validate them in some simulations. Throughout this paper, we 
will assume that p, n and n — p tend to infinity, and p <n. 

Finally, we extend our study to the case of problems with more 
general linear constraints, including, in particular, inequality con- 
straints. 

1. Introduction. Many statistical estimation problems are now formu- 
lated, implicitly or explicitly, as solutions of certain optimization problems. 
Naturally, the parameters of these problems tend to be estimated from data 
and it is therefore important that we understand the relationship between 
the solutions of two types of optimization problems: those which use the 
population parameters and those which use the estimated parameters. This 
question is particularly relevant in high-dimensional inference where one sus- 
pects that the differences between the two solutions might be considerable. 
The aim of this paper is to contribute to this understanding by focusing on 
quadratic programs with linear constraints. An important example of such 
a program where our questions are very natural is the celebrated Markowitz 
optimization problem in Finance which will serve as a supporting example 
throughout the paper. 

The Markowitz problem [Markowitz (1952)] is a classic portfolio opti- 
mization problem in Finance, where investors choose to invest according to 
the following framework: one picks assets in such a way that the portfolio 
guarantees a certain level of expected returns but minimizes the "risk" as- 
sociated with them. In the standard framework, this risk is measured the 
variance of the portfolio. 

Markowitz's paper was highly influential and much work has followed. It 
is now part of the standard textbook literature on these issues [Ruppert 
(2006), Campbell, Lo and MacKinlay (1996)]. Let us recall the setup of the 
Markowitz problem. 

• We have the opportunity to invest in p assets, A±, .. . ,A p . 

• In the ideal situation, the mean returns are known and represented by a 
p-dimensional vector, fi. 

• Also, the covariance between the returns is known; we denote it by X. 

• We want to create a portfolio, with guaranteed mean return fip, and 
minimize its risk, as measured by variance. 

• The question is how should items be weighted in portfolio? What are 
weights wl 
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We note that £ is positive semi-definite and hence is in particular symmet- 
ric. In the ideal (or population) solution, the covariance and the mean are 
known. The mathematical formulation is then the following simple quadratic 
program. We wish to find the weights w that solve the following problem: 



Here e is a p-dimensional vector with 1 in every entry. If £ is invertible, the 
solution is known explicitly (see Section 2). If we call ^optimal the solution of 
this problem, the curve to' timal Eiy op ti ma i, seen as a function of fip, is called 
the efficient frontier. 

Of course, in practice, we do not know \i and £ and we need to estimate 
them. An interesting question is therefore to know what happens in the 
Markowitz problem when we replace population quantities by corresponding 
estimators. 

Naturally, we can ask a similar question for general quadratic programs 
with linear constraints [see below or Boyd and Vandenberghe (2004) for a 
definition], the Markowitz problem being a particular instance of such a 
problem. This paper provides an answer to these questions under certain 
distributional assumptions on the data. Hence our paper is really about 
the impact of estimation error on certain high-dimensional M-estimation 
problems. 

It has been observed by many that there are problems in practice when 
replacing population quantities by standard estimators [see Lai and Xing 
(2008), Section 3.5], and alternatives have been proposed. A famous one 
is the Black-Litterman model [Black and Litterman (1990), Meucci (2005) 
and, e.g., Meucci (2008)]. Adjustments to the standard estimators have also 
been proposed: Ledoit and Wolf (2004), partly motivated by portfolio op- 
timization problems, proposed to "shrink" the sample covariance matrix 
toward another positive definite matrix (often the identity matrix properly 
scaled), while Michaud (1998) proposed to use the bootstrap and to average 
bootstrap weights to find better-behaved weights for the portfolio. As noted 
in Lai and Xing (2008), there is a dearth of theoretical studies regarding, in 
particular, the behavior of bootstrap estimators. 

An aspect of the problem that is of particular interest to us is the study of 
large-dimensional portfolios (or quadratic programs with linear constraints). 
To make matters clear, we focus on a portfolio with p = 100 assets. If we 
use a year of daily data to estimate E, the covariance between the daily re- 
turns of the assets, we have n ~ 250 observations at our disposal. In modern 
statistical parlance, we are therefore in a "large n, large p" setting, and we 
know from random matrix theory that E the sample covariance matrix is a 
poor estimator of E, especially when it comes to spectral properties of £. 
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There is now a developing statistical literature on properties of sample co- 
variance matrices when n and p are both large, and it is now understood 
that, though E is unbiased for E, the eigenvalues and eigenvectors of E 
behave very differently from those of E. We refer the interested reader to 
Johnstone (2001), El Karoui (2007, 2008, 2009a), Bickel and Levina (2008a), 
Rothman et al. (2008) for a partial introduction to these problems. We wish 
with this study to make clear that the "large n, large p v character of the 
problem has an important impact of the empirical solution of the problem. 
By contrast, standard but thorough discussions of these problems [Meucci 
(2005)] give only a cursory treatment of dimensionality issues (e.g., one page 
out of a whole book). 

Another interesting aspect of this problem is that the high-dimensional 
setting does not allow, by contrast to the classical "small p, large n" setting, 
a perturbative approach to go through. In the "small p, large n" setting, the 
paper Jobson and Korkie (1980) is concerned, in the Gaussian case, with 
issues similar to the ones we will be investigating. 

The "large n, large p" setting is the one with which random matrix theory 
is concerned — and the high-dimensional Markowitz problem has therefore 
been of interest to random matrix theorists for some time now. We note in 
particular the paper Laloux et al. (2000), where a random matrix- inspired 
(shrinkage) approach to improved estimation of the sample covariance ma- 
trix is proposed in the context of the Markowitz problem. 

Let us now remind the reader of some basic facts of random matrix theory 
that suggest that serious problems may arise if one solves naively the high- 
dimensional Markowitz problem or other quadratic programs with linear 
equality constraints. A key result in random matrix theory is the Marcenko- 
Pastur equation [Marcenko and Pastur (1967)] which characterizes the lim- 
iting distribution of the eigenvalues of the sample covariance matrix and 
relates it to the spectral distribution of the population covariance matrix. 
We give only in this introduction its simplest form and refer the reader to 
Marcenko and Pastur (1967), Wachter (1978), Silverstein (1995), Bai (1999) 
and, for example, El Karoui (2009a) for a more thorough introduction and 
very recent developments, as well as potential geometric and statistical lim- 
itations of the models usually considered in random matrix theory. 

In the simplest setting, we consider data { Aj}™ =1 , which are p-dimensional. 
In a financial context, these vectors would be vectors of (log)-returns of as- 
sets, the portfolio consisting of p assets. To simplify the exposition, let us 
assume that the Aj's are i.i.d. with distribution A/"(0,Id p ). We call X the 
n x p matrix whose ith row is the vector Aj. Let us consider the sample 
covariance matrix 

E = — ^-(A-A)'(A-A), 
n — 1 
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Fig. 1. Illustration of Marcenko-Pastur-law, n = 500, p = 200. The red curve is the den- 
sity of the Marcenko-Pastur-law for p = 2/5. The simulation was done with i.i. d. Gaussian 
data. The histogram is the histogram of eigenvalues of X'X/n. 



where X is a matrix whose rows are all equal to the column mean of X . 
Now let us call F p the spectral distribution of S, that is, the probability 
distribution that puts mass 1/p at each of the p eigenvalues of £. A graphi- 
cal representation of this probability distribution is naturally the histogram 
of eigenvalues of S. A consequence of the main result of the very profound 
paper Marcenko and Pastur (1967) is that F p , though a random measure, is 
asymptotically nonrandom, and its limit, in the sense of weak convergence 
of distributions, F has a density (when p <n) that can be computed. F de- 
pends on p = limn^oop/n in the following manner: if p < n, the density of 
F is 



Jp[X) ~2irp x h-<x<y + , 

where y + = (1 + ^fp) 2 and y_ = (1 — ^fp) 2 ■ Figure 1 presents a graphical 
illustration of this result. 

What is striking about this result is that it implies that the largest eigen- 
value of S, Ai, will be overestimated by l\ the largest eigenvalue of S. Also, 
the smallest eigenvalue of S, X p , will be underestimated by the smallest 
eigenvalue of S, L. As a matter of fact, in the model described above, E has 
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all its eigenvalues equal to 1, so Ai(E) = A p (£) = 1, while l\ will asymptoti- 
cally be larger or equal to (1 + ^fp) 2 and l p smaller or equal to (1 — ^fp) 2 (in 
the Gaussian case and several others, l\ and l p converge to those limits). We 
note that the result of Marcenko and Pastur (1967) is not limited to the case 
where E is identity, as presented here, but holds for general covariance E 
(F p has of course a different limit then). 

Perhaps more concretely, let us consider a projection of the data along a 
vector v, with \\v\\2 = 1, where ||t> [[2 is the Euclidian norm of v. Here it is clear 
that, if X ~ 7V(0,Id p ), v&r(v'X) = 1, for all v, since v'X ~ JV(0, 1). However, 
if we do not know £ and estimate it by E, a naive (and wrong) reasoning 
suggests that we can find direction of lower variance than 1, namely those 
corresponding to eigenvectors of E associated with eigenvalues that are less 
than 1. In particular, if v p is the eigenvector associated with L, the smallest 
eigenvalue of E, by naively estimating, for X independent of {Xj}" =1 , the 
variance in the direction of v p , var(v' p X), by the empirical version v' p T,v p , 
one would commit a severe mistake: the variance in any direction is 1, but 
it would be estimated by something roughly equal to (1 — \/p/n) 2 in the 
direction of v p . 

In a portfolio optimization context, this suggests that by using standard 
estimators, such as the sample covariance matrix, when solving the high- 
dimensional Markowitz problem, one might underestimate the variance of 
certain portfolios (or "optimal" vectors of weights). As a matter of fact, in 
the previous toy example, thinking (wrongly) that there is low variance in 
the direction v p , one might (numerically) "load" this direction more than 
warranted, given that the true variance is the same in all directions. 

This simple argument suggests that severe problems might arise in the 
high-dimensional Markowitz problem and other quadratic programs with 
linear constraints, and in particular, risk might be underestimated. While 
this heuristic argument is probably clear to specialists of random matrix 
theory, the problem had not been investigated at a mathematical level of 
rigor in that literature before this paper was submitted [the paper Bai, Liu 
and Wong (2009) has appeared while this paper was being refereed. It is 
concerned with different models than the ones we will be investigating and 
our results do not overlap] . It has received some attention at a physical level 
of rigor [see, e.g., Pafka and Kondor (2003), where the authors treat only 
the Gaussian case, and do not investigate the effect of the mean, which as 
we show below creates problems of its own]. In this paper, we propose a 
theoretical analysis of the problem in a Gaussian and elliptical framework 
for general quadratic programs with linear constraints, one of them involving 
the parameter /i. Our results and contributions are several- fold. We relate 
the empirical efficient frontier to the theoretical efficient frontier that is 
key to the Markowitz theory, in a variety of theoretical settings. We show 
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that the empirical frontier generally yields an underestimation of the risk 
of the portfolio and that Gaussian analysis gives an over-optimistic view of 
this problem. We show that the expected returns of the naive "optimal" 
portfolio are poorly estimated by /xp. We argue that the bootstrap will not 
solve the problems we are pointing out here. Beside new formulas, we also 
provide robust estimators of the various quantities we are interested in. 

The paper is divided into four main parts and a conclusion. In Section 2, 
to make the paper self-contained, we discuss the solution of quadratic prob- 
lems with linear equality constraints — a focus of this paper. In Section 3, we 
study the impact of parameter estimation on the solution of these problems 
when the observed data is i.i.d. Gaussian and obtain some exact distribu- 
tional results for fixed p and n. In Section 4, we obtain results in the case 
where the data is elliptically distributed. This allows us also to understand 
the impact of correlation between observations in the Gaussian case and 
to get information about the behavior of the nonparametric bootstrap. In 
Section 5, we apply the results of Section 4 to the quadratic programs at 
hand and compare the elliptical and the Gaussian cases. We show, among 
other things, that the Gaussian results are not robust in the class of elliptical 
distribution. In particular, two models may yield the same /i and £ but can 
have very different empirical behavior. In Section 5, we also propose vari- 
ous schemes to correct the problems we highlight (see pages 63, 64 and 65 
for pictures) and study more general problems with linear constraints (see 
Section 5.6). The conclusion summarizes our findings and the Appendix con- 
tains various facts and proofs that did not naturally flow in the main text 
or were better highlighted by being stated separately. 

Several times in the paper S _1 and S^ 1 will appear. Unless otherwise 
noted, when taking the inverse of a population matrix, we implicitly assume 
that it exists. The question of existence of inverse of sample covariance 
matrices is well understood in the statistics literature. Because our models 
will have a component with a continuous distribution, there are essentially no 
existence problems (unless we explicitly mention and treat them) as proofs 
similar to standard ones found in textbooks [e.g., Anderson (2003)] would 
show. Hence, we do not belabor this point any further in the rest of the 
paper as our focus is on things other than rather well-understood technical 
details, and the paper is already a bit long. 

Finally, let us mention that while the Finance motivation for our study is 
important to us, we treat the problem in this paper as a high-dimensional 
M-estimation question (which we think has practical relevance). We will 
not introduce particular modelization assumptions which might be relevant 
for practitioners of Finance but might make the paper less relevant in other 
fields. A companion paper [El Karoui (2009b)] deals with more "financial" 
issues and the important question of the realized risk of portfolios that are 
"plug-in" solutions of the Markowitz problem. 
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2. Quadratic programs with linear equality constraints. We discuss here 
the properties of the solution of quadratic programs with linear equality con- 
straints as they lay the foundations for our analysis of similar problems in- 
volving estimated parameters (and of problems with inequality constraints) . 
We included this section for the convenience of the reader to make the paper 
as self-contained as possible. 

The problem we want to solve is the following: 

-, I mm -w hw, 

(QP-eqc) < we Rp 2 

I w'v.i =Ui, 1 < i < k. 

Here £ is a positive definite matrix of size p x p, V{ 6F and Ui 6 R. We 
have the following theorem: 

Theorem 2.1. Let us call V the p x k matrix whose ith column is Vi, 
U the k dimensional vector whose ith entry is U{ and M the k x k matrix 

M = V'YT X V. 

We assume that the Vi 's are such that M is invertible. The solution of the 
quadratic program with linear equality constraints (QP-eqc) is achieved for 

^optimal = Z~ 1 VM~ 1 U, 

and we have 

^optimal ^optimal = U'M~ X U. 

Proof. Let us call X a k dimensional vector of Lagrange multipliers. 
The Lagrangian function is, in matrix notation, 

L ^ W)X ) = ^^-\'{V'w-U). 

This is clearly a (strictly) convex function in w, since S is positive definite 
by assumption. We have 

ow 

So ^optimal = ^~ 1 VX. Now we know that U = V'w opt i mai \- So U = V'T,~ 1 VX = 
MX. Therefore, 

^optimal = Z~ 1 VM- 1 U. 

We deduce immediately that 

^optimal S ^optimal = U'M~ l U. □ 

We now turn to another result which will prove to be useful later. It 
gives a compact representation of linear combinations of the weights of the 
optimal solution, and we will rely heavily on it in particular in the case of 
Gaussian data. 
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Lemma 2.2. Let us consider u> op timai the s °l u ti° n of the optimization 
problem (QP-eqc). Let 7 be a vector in MP. Let us call M the (k + 1) x (k + 1) 
matrix that is written in block form 

7'xrV ys - ^ 



M -- 

Assume that A4 is invertible. Then 



(1) ^'optimal = ^ (f/'O)^- 1 f * 

Proof. The proof is a consequence of the results discussed in the Appen- 
dix concerning inverses of partitioned matrices [see Section A.l and equation 
(A. 4) there]. Let us write 



M 



Mn M12 
M21 M22 



where Aiu is k x k, M12 is naturally k x 1 and M22 is a scalar. With the 
same block notation, we have 

M ~ {m 21 m 22 

Then, we know [see equation (A. 4)] that M 12 = -M^MuM 22 , but since 
M 22 is a scalar, equal to j^A~ 1 (k + l,k + 1), we have 

Now M^Mi2 = (V'E-^J-V'E-S, so l/'A^A^a = ^ op timai7- Hence, 
^ P ti m ai7 = -^(^0)^' 1 (° 1 fc ). n 

We note that here (A^ 22 )" 1 = j'^j - j'^VM^V'^j, as an ap- 
plication of equation (A. 2) clearly shows. 

3. QP with equality constraints: Impact of parameter estimation in the 
Gaussian case. From now on, we will assume that we are in the high- 
dimensional setting where p and n go to infinity. Our study will be divided 
into two. We will first consider the Gaussian setting (in this section) and 
then study an elliptical distribution setting (in Section 4) . (We note that for 
the Markowitz problem, the assumption of Gaussianity would be satisfied 
if we worked under Black-Scholes diffusion assumptions for our assets and 
were considering log-returns as our observations.) Interestingly, we will show 
that the results are not robust against the assumption of Gaussianity, which 
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is not (so) surprising in light of recent random matrix results [see El Karoui 
(2009a)]. We will also show that understanding the elliptical setting allows 
us to understand the impact of correlation between observations and to dis- 
cuss bootstrap-related ideas. In particular, we will see that various problems 
arise with the bootstrap in high-dimension and that the results change when 
one deals with observations that are correlated (in time) or not. 

We also address similar questions concerning inequality constrained prob- 
lems in Section 5.6. 

Before we proceed, we need to set up some notations: we call e the p- 
dimensional vector whose entries are all equal to 1. We call V, as above, 
the matrix containing all of our constraint vectors, which we may have to 
estimate (for instance, if Vi = jjL for a certain i). We call V the matrix of 
estimated constraint vectors. 

The template question for all our investigations will be the following 
(Markowitz) question: what can be said of the statistical properties of the 
solution of 

min w'Y>w, 



w'fi = fip, 



w'e = 1 

compared to the solution of the population version 

min w'Y>w, 



w'e = l? 

We will solve the problem at a much greater degree of generality, by 
considering first quadratic programs with linear equality constraints (see 
Section 5.6 for inequality constraints) and comparing the solutions of 

min w'TiW, 



w'vj = Uj, 1 < i < k — 1, 



(QP-eqc-Emp) 
and 

{min w'YiW, 
w'vi = Ui , l<i<k-l, 

Here £ and £i will be estimated from the data. We call w emp the vector that 
yields a solution of problem (QP-eqc-Emp) and w t heo the vector that yields 
a solution of problem (QP-eqc-Pop). 

We call V the p x k matrix containing {v i}^ 1 and ft, and V its population 
counterpart, which contains {vi}^^ and [i. We assume that \yi\\~\ are 
deterministic and known (just like the vector e in the Markowitz problem). 
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In our analysis, k will be held fixed. (The kth column of V will contain fi in 
general or our estimator of fj,.) 

As should be clear from Theorem 2.1, the properties of the entries of the 
matrix V'Y>~ l V as compared to those of the matrix V'Y,~ l V will be key 
to our understanding of this question. In what follows, we assume that the 
vectors in are either deterministic or equal to fi. The extension to linear 
combinations of a deterministic vector and fi is straightforward. We also 
note that in the Gaussian case, we could just assume that the Vi are (deter- 
ministic) functions of ft (because fi and E are independent in this case). On 
the other hand, the vector U is assumed to be deterministic. 

Before we proceed, let us mention that after our study was completed, we 
learned of similar results (restricted to the Markowitz case and not dealing 
with general quadratic programs with linear equality constraints) by Kan 
and Smith (2008). We stress the fact that our work was independent of theirs 
and is more general which is why it is included in the paper. 

3.1. Efficient frontier problems. We first study questions concerning the 
efficient frontier and then turn to information we can get about linear func- 
tionals of the empirical weights. 

Theorem 3.1. Let us assume that we observe data Af(pt, E), for 

i = 1, ... ,n. Here E is p x p and p < n. Suppose we estimate E with the 
sample covariance matrix E, and ji with the sample mean fi. Suppose we 
wish to solve the problem 



where Uj are deterministic, Vj are deterministic and given for j < k and 
Vk = fi- Assume that we use as a proxy for the previous problem the empir- 
ical version with plugged-in parameters. Let us consider the solution of the 
problem 



Now dj = Vj for j < k and = f(fi), for a given deterministic function f . 
Let us call M) emp the corresponding "weight" vector. The plug-in estimate of 
w'T,w is w' e Ewemp- Let us call w racic the optimal solution of the quadratic 
program obtained under the assumption that E is given, but \i is not and is 
estimated by /(/*). Finally, we assume that n — 1— p + k> 0. 
Then we have 

v 2 

/r)\ ' V — ' V A n— 1— p+k 

\ z ) W e mp 2 -' w emp — "^oracle ^^oraclc _ -, > 



(QP-eqc-Pop) 




(QP-eqc-Emp) 




12 



N. EL KAROUI 



where tt/ racle Eu> racle is random (because ft is) but is statistically independent 
ofx 2 n -i- p+k - Also, 



The previous theorem means that the cost of not knowing the covariance 

matrix and estimating it is the apparition of the n w !_i • I n the high- 
dimensional setting when p and n are of the same order of magnitude and n — 
p is large, this terms is approximately 1 — (p — k)/(n — 1). Hence, the theorem 
quantifies the random matrix intuition that having to estimate the high- 
dimensional covariance matrix at stake here leads to risk underestimation, 
by the factor 1 — (p — k)/(n — 1). In other words, using plug-in procedures 
leads to over-optimistic conclusions in this situation. 

We also note that the previous theorem shows that, in the Gaussian set- 
ting under study here, the effect of estimating the mean and the covariance 
on the solution of the quadratic program are "separable": the effect of the 
mean estimation is in the oracle term, while the effect of estimating the 
covariance is in the Xn- P -i+fc/( n ~~ 1) term. To show risk underestimation, 
it will therefore be necessary to relate w' OTade T l w ora _ c i e to w' theo 'Ewtheo- We do 
it in Proposition 3.2 but first give a proof of Theorem 3.1. 

Proof of Theorem 3.1. The crux of the proof is the following re- 
sult, which is well known by statisticians, concerning (essentially) blocks 
of the inverse of a Wishart matrix: if S ~ Wp(S,m), that is, S is a p x p 
Wishart matrix with m degree of freedoms and covariance E, and A is p x k, 
deterministic matrix, then, when m> p, 

(A'S^A)" 1 ~ W k {{A'Yr 1 A)- 1 ,m-p + k). 

We refer to Eaton [(1983), Proposition 8.9, page 312] for a proof, and to 
Mardia, Kent and Bibby [(1979), pages 70-73] for related results. 

Another important remark is the well-known fact that, in the situation 
we are considering, /t is jV(/x,S/n) and independent of £. Finally, it is also 
well known that if S ~ W p (T,,m) and U is a p-dimensional deterministic 
vector, then U'SU = \J'YAJx 2 m - 

Now S ~ W p (E,n — l)/(n — 1). Therefore, since V is a function of ft, we 
have, by independence of ft and E, 

(V't~ l V)- l \fi^W k ((V'T l - 1 VY l ,n-l-p + k)/{n-l). 

Therefore, 



v 2 

A-n— p— 1+fc 

n — 1 
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Because the right-hand side does not depend on p, we have established the 
independence of 

u'{v't- l vy l u and Xn- P -i+fc 
u'iv'^vyw n-l 

Hence, we conclude that 

u'(v'£- 1 vr 1 u= u'(v'^ 1 vr 1 u x2n - p - 1+k , 

n — 1 

and the two terms are independent. Now the term U' (V'Yj^V)" 1 !! is the 
estimate we would get for the solution of problem (QP-eqc-Pop), if £ were 
known and p were estimated by /(/*)■ In other words, it is the "oracle" 
solution described above. □ 

3.1.1. Some remarks on the oracle solution. Theorem 3.1 sheds light on 
the separate effects of mean and covariance estimation on the problem con- 
sidered above. To understand further the problem of risk estimation, we 
need to better understand the role the estimation of the mean might play. 
This is what we do now. 

Proposition 3.2. Suppose that the last column of V is p. Let us call 
V-k the p x k — 1 dimensional matrix whose jth column is Vj, which are 
known deterministic vectors. Suppose that M = V'Yr V = 0(1). Suppose 
further that X k (V'T l ~ 1 V) S> n" 1 / 2 , where A^(S') is the smallest eigenvalue of 
the k x k matrix S. 

Further, call M = V'Y^^V € M fcxfc and call the canonical basis vectors 
in M. k . Finally, call a = Xp/ n - 



Then, when p/n—> p G (0,1), asymptotically, 

(U'M^ek) 2 



^oracle Coracle = ^theo^thco ~ OL— V Op (w' theo Y,W t heo , 

l + ae' k M L e k 



Let us discuss a little bit this result before we provide a proof. In the 
asymptotics we have in mind and are considering, p/n — > p € (0,1) and 
therefore a ~ p/n + 0(n -1 / 2 ). So if 8 n = (W M' 1 e k ) 2 / (1 + p/ne^M' 1 e k ), 
when the above analysis applies, the impact of the estimation of p by fi will 
be risk underestimation, just as is the case for the case of the covariance 
matrix. Here, we can also quantify the impact of this estimation of p by p: 
it leads to risk underestimation by the amount aS n . 

Proof of Proposition 3.2. Let us write p = p + e, where e ~ A/"(0, S/n). 
Clearly, e = n~ l / 2 T}/ 2 Z, where Z is A/"(0,Id p ). We have, using block nota- 
tions, 

\ . { VL } X- 1 e\ 
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Replacing e by its value, we have //X _1 e ~ M(0, fi'Y,' 1 fi/n). By the same 
token, we can also get that 

vu^e = ±vL k x~v*z ~ A/jo, Yd^Y±) . 

Our assumption that V'Tj^V = 0(1) implies that //E -1 // = 0(1) and V_, x 
E-V_ fe = 0(l). Therefore, 

VL h Ve\ f\_ 
e'^V^. 2^S~ 1 e J P \V^, 

Hence, since e'S _1 e = Z'Z/n = a, 

V'Z^V = V'YT^V + ae k e' k + P (n" 1/2 ). 

Our assumptions guarantee that X k (V'T,~ 1 V) 3> n -1 / 2 , and therefore 
Afc(y'S _1 x V + ae k e' k ) ^> n -1 / 2 . In other respects, let A be a matrix such 
that X P (A) » n" 1 / 2 and E be a matrix such that E = 0(n~ 1 / 2 ). Recall that 
for symmetric matrices, X p (A + E) > X P (A) + X P (E) [see, e.g., Weyl's theo- 
rem, Horn and Johnson (1994), page 185]. So in this situation, (A + E)^ 1 = 
o{n 1 / 2 ). Let us now consider the implications of this remark on the difference 
of (A + E)' 1 and A' 1 . We claim that (A + E)' 1 = A' 1 + o(A" 1 ). By the 
first resolvent identity, (A + E)~ l = A" 1 - (A + E^EA' 1 ; our previous 
remark implies that o~i[(A + E)^ 1 E] =o(l) and the result follows. Applying 
the results of this discussion to A = V'H^V + ae k e' k and A + E = t/'XT 1 V ', 
we have 

V'^- l V = (V'E- l V + ae k e' k )- 1 + o P ({V^~ l V + ae/^)" 1 ). 

We can now use well-known results concerning inverses of rank-1 perturba- 
tion of matrices, namely 



(F'S-V + ae^r 1 = (M + aefee'fc)" 1 = M" 



a- 



1 + ae / fc M- 1 e fc ' 
This allows us to conclude that 

U'iV'^V)- 1 !! = U'M~ l U - a ^'^l^f + o P {U'M- x U). 

This is the result announced in the theorem and the proof is complete. □ 

We can now combine the results of Theorem 3.1 and Proposition 3.2 to 
obtain the following corollary. 

Corollary 3.3. We assume that the assumptions of Theorem 3.1 and 
Proposition 3.2 hold and that p/n has a finite nonzero limit, as n — > oo, and 
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n — p tends to infinity. Then we have 

, - _/ P -k\f , p (U'M-^k) 2 \ 

w emp ^w emp - ^1 - — -J ^ theo l^theo - - 1 + {p/n y M -i e J 

(3) 

+ Opf^Ihco^theo V n 1/2 ), 

where M is the population quantity M = V'T,~ 1 V . 

The corollary shows that the effects of both covariance and mean estima- 
tion are to underestimate the risk, and the empirical frontier is asymptoti- 
cally deterministic. 

3.2. On the optimal weights. Our matrix characterization of the empiri- 
cal optimal weights (Lemma 2.2) allows us to give a precise characterization 
of the statistical properties of linear functionals of these weights. We give 
here some exact results, concerning distributions and expectations of those 
functionals. A longer discussion, including robustness and more detailed bias 
issues can be found in Section 5. 

Proposition 3.4. Assume that the assumptions of Theorem 3.1 hold 
and in particular X{ are i.i.d. A/"(/i,S p ). Let 7 be a fixed n- dimensional 

vector. Let us call = (V7) the p x (k + 1) matrix whose first k columns 
are those of V. Let iV 7 = (Vj / T<~ 1 V~ f )~ 1 and be a (k + 1) x (k + 1) matrix 
with distribution W/ c +i(A r 7 , n — p + k) (conditional on fi). Then, 

'A- 



In particular, 



E(7'^ P |A) = -%^^ 
N 7 (k + l t k + l) 



We note, somewhat heuristically, that when [i is estimated by ft, since 
jl ~ A/"(/i, S/n), fi'Yi~ l ji ~ + p/n, when p, n and n — p are all large 

(we refer again to Section 5 for a more precise statement). Hence iV 7 is a not 
a consistent estimator of N y = (V'T, Vy) . As we will see in Section 5.2 
and as can be expected from the previous proposition, this will also imply 
bias for linear combinations of empirical optimal weights. We will show in 
particular that returns are overestimated when using fi as an estimator for fi. 

Another interesting aspect of the previous proposition is that it allows us 
to understand the fluctuation behavior of 7'w C mp when n — p + k is large: as 
a matter of fact, the limiting fluctuation behavior of the entries of a (fixed- 
dimensional) Wishart matrix with large number of degrees of freedom is 
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well known [see, e.g., Anderson (2003), Theorem 3.4.4, page 87] and the 
<5-method can be applied to get the information — conditional on £i. 

For instance, if we assume that, conditional on fi, the matrix iV 7 converges 
to a matrix N~, which possibly depends on /t, we see that calling v the 
last column W~/(n — p + k), v is asymptotically normal (all statements 
are conditional on /}), if n — p + k goes to infinity when p and n go to 
infinity. Furthermore we know the limiting covariance of v (after scaling by 
y/n — p + k), using Theorem 3.4.4 in Anderson (2003). Let us call it Tq and 
let us call fo the limit of v — which we assume exists. 

If we assume that fo(fc + 1) is not 0, Slutsky's lemma and the <5-method 
give us through simple computations that 

Vn-p + k ( V Wemp + E ^/,T?f } ) fi^—^—^M(0,C'T C), 
V F u Q (k + l) J v (k + iy 

where C = u (k + 1)Q - (QVo)e fe+1 . 

We know the distribution of fi, so we could get (limiting) unconditional 
results for r y'w emp . This is not hard but a bit tedious if we want explicit 
expressions, and because our focus is mostly on first-order properties in this 
paper, we do not state the result. 

Proof of Proposition 3.4. The proof follows from the representation 
we gave in Lemma 2.2, that is, 

7 ' w = — _ — (c/ / o)(y's- 1 y 7 )- 1 ( °* 

P (^E-%)-i(ife + l,fc + l) 7 V 1 

and the fact that, by the same arguments as before, conditional on (1, 

(%±-%)- x \ii ~ W k+1 ((Vfi- 1 %)-\n-p + k)/(n - 1). 
We conclude that 

c (u'm-i (°i fc ) eIi^yM+i) 

7^emp|^ W^k + l^k + l) W^(k + l,k + l) ' 

This shows the fist part of the proposition. 

The second part follows from the following observation. Suppose the ma- 
trix P is yV p (ldp,K). If a and f3 are n-dimensional, orthogonal vectors, let 
us consider 

a'PP 

We can, of course, write P = Ylf=x where Yi are i.i.d. A/"(0, Id p ). In other 
respects, Y(a and Y[(3 are clearly independent normal random variables, 
since their covariance is a' (3 = 0, and they are normal. So 

'a'PP 



E 



{y//3}^)=o 
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because the quantity whose expectation we are taking is a linear combination 
of mean independent normal random variables. Hence, also, 

Now, when a is not orthogonal to (3, we write a = f3(a' f3)/\\(3\\2 + 5, where 
5 is orthogonal to f3. We immediately deduce that in general, 

' a'P(3 \ _ a'p | E fS'PP\ a'fi 



(3>P(3j \P'PP) llflll ' 

Furthermore, when P is Wp(S,-ftT), because we can write P = T, 1 I 2 P Q Y}/ 2 , 
where Pq ~ W p (ld p , K), we finally have 



E 



a'P(3\ a'S/3 



P'PP J P'EP 



In the case of interest to us, we have a = ( ), ft = ek+i and £ = iV 7 . 
Applying the previous formula gives us the second part of the proposition. 
□ 



We now turn to the question of understanding the robustness properties 
of the Gaussian results we just obtained. We will do so by studying the same 
problems under more general distributional assumptions, and specifically we 
will now assume that the observations are elliptically distributed. 

4. Solutions of quadratic programs when the data is elliptically distributed. 

In Section 3, we studied the properties of the "plug-in" solution of problem 
(QP-eqc-Pop) under the assumption that the data was normally distributed. 
While this allowed us to shed light on the statistical properties of the so- 
lution of problem (QP-eqc-Emp), it is naturally extremely important to 
understand how robust the results are to our normality assumptions. 

In this section, we will consider elliptical models, that is, models such that 
the data can be expressed as 

X i = l i + \ i T}l 2 Y il 

where \ is a random variable and Y{ are i.i.d. AA(0,Id p ) entries. Aj and Y{ 
are assumed to be independent, and to lift the indeterminacy between £ 
and A, we assume that E(A?) = 1. Under this assumption, we clearly have 
cov(Xj) = S. We note that this is not the standard definition of elliptical 
models, which generally replaces Y{ with a vector uniformly distributed on 
the sphere in R p , but it captures the essence of the problem. We refer the 
interested reader to Anderson (2003) and Fang, Kotz and Ng (1990) for 
extensive discussions of elliptical distributions. 



18 



N. EL KAROUI 



Our motivation for undertaking this study comes also from the fact that 
for certain types of data, such as financial data, it is sometimes argued that 
elliptical models are more reasonable than Gaussian ones, for instance, be- 
cause they can capture nontrivial tail dependence [see Frahm and Jaekel 
(2005) where such models are advocated for high-dimensional modelization 
of financial returns, Meucci (2005) for a discussion of their relevance for cer- 
tain financial markets, Biroli, Bouchaud and Potters (2007) for modelization 
considerations quite similar to Frahm and Jaekel (2005) and McNeil, Frey 
and Embrechts (2005) for a thorough discussion of tail dependence]. From a 
theoretical standpoint, considering elliptical models will also help in several 
other ways: the results will yield alternative proofs to some of the results we 
obtained in the Gaussian case, they will allow us to deal with some situa- 
tions where the data Xi are not independent and they will also allow us to 
understand the properties of the bootstrap. 

We also want to point out that elliptical distributions allow us to not fall 
into the geometric "trap" of standard random matrix models highlighted 
in El Karoui (2009a): the fact that data vectors drawn from standard ran- 
dom matrix models are essentially assumed to be almost orthogonal to one 
another and that their norm (after renormalization by l/^Jp) is almost con- 
stant. In a sense, studying elliptical models will allow us to understand what 
is the impact of the implicit geometric assumptions made about the data 
when assuming normality. (We purposely do so not under minimal assump- 
tions but under assumptions that capture the essence of the problem while 
allowing us to show in the proofs the key stochastic phenomena at play.) 
This part of the article can therefore be viewed as a continuation of the 
investigation we started in El Karoui (2009a) where we showed a lack of ro- 
bustness of random matrix models (contradicting claims of "universality") 
by thoroughly investigating limiting spectral distribution properties of high- 
dimensional covariance matrices when the data is drawn according to ellipti- 
cal models and generalizations. We show here that the theoretical problems 
we highlighted in El Karoui (2009a) have important practical consequences. 
[For more references on elliptical models in a random matrix context, we 
refer the reader to El Karoui (2009a) where an extended bibliography can 
be found.] 

We now turn to the problem of understanding the solution of problem 
(QP-eqc-Emp) in the setting where the data is elliptically distributed. We 
will limit ourselves to the case where the matrix V is full of known and 
deterministic vectors, except possibly for the sample mean. In this section 
we restrict ourselves to convergence in probability results. It is clear from 
Section 2 that to tackle the problems we are considering we need to under- 
stand at least three types of quantities: v'T>~ 1 v for a deterministic v with 
unit norm, jl'T,~ 1 v and /t'E -1 /}. 
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Here is a brief overview of our findings. When we consider elliptical mod- 
els, our results say that roughly speaking, under certain assumptions given 
precisely later: 

1. - >s, where s satisfies, if G is the limit law of the empirical distri- 
bution of the Af and p/n — > p £ (0, 1), J ppp^ =1 — p. 

2. If fi = 0, ji'±- l p -> p/(l - p). 

3. If pi = 0, pft^v-tO. 

All these convergence results are to be understood in probability. They nat- 
urally allow us — under certain conditions on the population parameters — to 
conclude about the convergence in probability of the matrix V'Y>~ l V . The 
results mentioned above are stated in all details in Theorems 4.1 and 4.6. 

In the situation where Aj are i.i.d., the results above hold when Aj have a 
second moment and they do not put too much mass near 0. This is interesting 
in practice because it tells us that our results hold for heavy-tailed data, 
which are of particular interest in some financial applications. 

The bootstrap situation corresponds basically to G being Poisson(l), 
which we denote by Po(l). Also in the statement above for p,'T l ~ 1 jj 1 , one 
should replace p/(l — p) by s — 1 in the bootstrap case. This is explained in 
Theorem 4.12 and Section 4.4.4. Finally, in the case of Gaussian data with 
"temporal" correlation, that is, when the data can be written in matrix form 
X = e n p' + AyS 1 / 2 , where A is not diagonal (and e„ is an n-dimensional 
vector with only l's in its entries), one should replace G by the limiting 
spectral distribution of A'A. The question of convergence of is then 

more involved. We refer to Proposition 4.8 for details about this situation. 

Though we are taking a fundamentally random matrix theoretic approach, 
our presentation purposely avoids borrowing too many techniques from ran- 
dom matrix theory in the hope of making clear (er) the phenomena that yield 
the results we will obtain. A more general but considerably more technically 
complicated (for non-specialists of random matrix theory) approach is being 
developed in our study of a connected problem and will appear in another 
paper. 

This section is divided into four subsections. The first two are devoted 
to the main technical issues arising in the study of the problem when the 
data is elliptically distributed. The third discusses the impact of correlation 
between observations when the data is Gaussian, as it can be recast as a 
variant of elliptical problems. The last subsection discusses questions related 
to the (nonpar ametric) bootstrap. 

4.1. On quadratic forms of the type v''E~ 1 v. The focus of this subsection 
is on understanding statistics of the type v'T>~ 1 v, where v is a deterministic 
vector. We will prove the following important theorem. 
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Theorem 4.1. Suppose we observe n observations X^, where X{ has the 
form Xi = p + \iT, l / 2 Yi, with Yi N(0, Id p ) and {Aj}™ =1 is independent of 
{5^}™ =1 . T}/ 2 is deterministic and E(Af) = 1. 

We call p n = p/n and assume that p n — > p € (0, 1). 

We use the notation Ti = A? and assume that the empirical distribution, 
G n , of Ti converges weakly in probability to a deterministic limit G. We also 
assume that Tj ^ for all i. 

If T(j) is the ith largest r^, we assume that we can find a random variable 
N G N and positive real numbers Eq and Co such that 

{P(p/N < 1 — eq) — > 1) asn— >co, 
P(t( N ) > Co) 1, 
3r]o > such that P(N/n > rjo) — >■ 1, asn— >oo. 

(Assumption-BB) 

Under these assumptions, if v is a (sequence of) deterministic vector, 

v'T,~ 1 v . 

, > 5 in probability, 

v'T, 1 v 

where s satisfies 

<4) 



1 + prs 

A few comments are in order before we turn to the proof. First, the 
assumption that A^ / for all i could be dispensed of, as long as all as- 
sumptions stated above hold when n is understood to denote the number of 
nonzero Aj's. Second, (Assumption-BB) concerning N and C will generally 
hold as soon as G does not put too much mass at 0, the only problem-specific 
question remaining being how much mass is put at by G compared to p, 
the limit of p/n. 

In particular, in the case where the r^'s are i.i.d., if there exists Co > 
and xo > such that Pg(X > Co) = xq > 0, and if G n is the empirical 
distribution of the Tj's, if G n =^ G, we see, using, for example, Lemma 2.2 
in van der Vaart (1998), that 

hminf P Gn (X > Co) = Card{T * > ° 0} > Pc(X > C )=x . 

So picking N = (1 — 5)xon will guarantee that we have, if G n =^ G in 
probability, P(tijs[\ > Co) — > 1 and, of course, P(N/n > rj) — > 1. Hence, in 
checking whether the theorem applies, we just need to see whether p/N 
stays bounded away from 1. 

In the simpler case when all the |Aj| are bounded away from 0, the con- 
ditions on N and C apply directly by taking N = n. Finally, let us say 
that (Assumption-BB) is needed in the proof to guarantee that the smallest 
eigenvalues of S stay bounded away from with high-probability. 
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We now briefly compare the Gaussian and elliptical cases. A simple con- 
vexity argument [relying on the fact that 1/(1 + x) is a convex function of 
x for x > and Jensen's inequality] shows that, if uq is the mean of G, 

1 1 

e>- . 

1-PA»g 

In the case of Gaussian data, G = 8\, that is, it is a point mass at 1 and 
we have s = 1/(1 — p). In other respects, for Xj to have covariance S, we 
need E(A 2 ) = 1. When the Aj's are i.i.d., with A? having distribution G, 
uq = E(A?) = 1, and we know that G n => 67 in probability. Therefore, in 
the class of elliptical distributions considered here, risk underestimation, 
which is essentially measured by 1/s (see Theorem 2.1 and Section 5) will be 
least severe in the Gaussian case. In other words, the Gaussian results lead 
to over-optimistic conclusions (in terms of proximity between sample and 
population solutions of the quadratic programs we are considering) within 
the class of elliptical distributions. 

We go back to these questions in more detail in Section 5 and now turn 
to the proof of Theorem 4.1. The proof could be carried out in at least 
two ways. We take one that is not standard but we feel best explains the 
phenomenon that is occurring. 

Proof OF Theorem 4.1. The proof is easier to carry out when we write 
the problem in matrix form. Because we focus on £, we can assume without 
loss of generality (wlog) that a = 0. Let us consider the nxp data matrix X 
whose ith row is Xj. Similarly, we denote by Y the nxp data matrix whose 
ith row is Y^. Let us call A the diagonal matrix with zth diagonal entry Aj 
and H = Id n — ee'/n, where e is an n-dimensional vector whose entries are 
all equal to 1. Note that H'H = H. With these notations, we have, since we 
assume that p = 0, 

x = Ays 1 / 2 . 

Therefore, X — X = HX, and 

-(X - X)'(X -X) = -^—t}I 2 y'khayt}' 2 . 



n — 1 n — 1 

Let us call L the matrix L = AHA. Note that Y'LY is a rank p matrix with 
probability 1, if we assume that p < n — 1 (recall that all the entries of A 
are nonzero). Hence, Y'LY is invertible with probability 1. Therefore, 

-l 



£-i = E -i/2( J_y' LY ) S- 1 / 2 . 
n — 1 



-i 



Finally, we have 



v'Ti 1 v \n- 
where v = S" 1 / 2 ^/!!^" 1 / 2 ?;!^ is a vector of £2 norm 1. 



v'^v .( 1 

v \ Y'LY] v, 
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We now make all of our statements conditional on A. Because of the 
independence of Y and A, we can therefore treat the Aj's as if they were 
constant and the lij's as i.i.d. M(0, 1) random variables. A is now assumed 
to be in the set of matrices C £ ^, defined just below, for which we have 
control of the smallest eigenvalue of S = Y'LY/(n — 1). In the steps that 
follow that are conditional on A, we therefore consider that we control the 
smallest eigenvalue of S. We note that if A is in £ £t $, N is lower bounded. 
Because N is a function of the Aj's and hence of A, we write all the results 
conditionally on A, but the reader should keep in mind that this conditioning 
constrains also the possible values of N. 

• The set C £j $. 

In Lemma B.l in the Appendix, we prove the following result: when A is 
such that p/N < 1 — e, if C n = Co^rr [see (Assumption-BB) and Lemma B.l 
for definitions] and 7 P is the smallest eigenvalue of Y'LY/n — 1, we have, if 
Pa denotes probability conditional on A, 

Pa{ y/% < V^[(l - \/l Z ~e) - t]) < exp(-(iV - l)t 2 ). 

Let us call £ £t $ the set of matrices A such that p/N < 1 — e and Cq(N — 
l)/(n— 1) > <5. Under (Assumption-BB), for a 5 bounded away from (e.g., 
5 = since we need a bound on liminf CqN/u that holds with prob- 

ability going to 1), P(A £ C £j s) — > 1. In other respects, if A G C £ ^, 

Pa(V% < V^[(l - VT=7) - t}) < exp(-(n - l)5t 2 /C ). 

• Getting results conditionally on A. 

If O is an orthogonal matrix, O'Y'LYO = Y LY, because Y is full of i.i.d. 
A/"(0, 1) random variables and is therefore invariant (in law) by left and right 
rotation. Therefore the eigenvalues and eigenvectors of Y' LY are indepen- 
dent and its matrix of eigenvectors is uniformly (i.e., Haar) distributed on 
the orthogonal group [see also Chikuse (2003), page 40, equation (2.4.4)]. 
Let us write a spectral decomposition of Y'LY 

1 p 

S = -Y'LY = Y,HVi<- 

n — 1 ^-^ 

i=l 

We know that a.s. ji ^ for all i, so 
We claim that 

^-^Ei|(W =1 ,A)^0. 

y i=l h 

To see this, note that E((z/uj) 2 ) = H^Hl/p = 1/p because v\ is uniformly 
distributed on the unit sphere when T (the matrix containing the V{) is 
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Haar distributed on the orthogonal group. Hence, given the independence 
between 7$ and v%, 

B(u'S~ 1 u\{ 7l }^ =1 ,A) = -j2-- 

Now let us call w the vector with Wi = (v'vi) 2 , and g the vector with ith en- 
try gi = l/7i. Clearly, since v'S~ x v = g'w, var(i/<S _1 z/|{7i}, A) = g' cov(w)g. 
By symmetry it is clear that cov (w)(i,i) = cov(w;)(l,l) and cov (w)(i, j) = 
cov(u>)(l,2) if i ^ j. Further, since the matrix T containing the vectors V{ 
is Haar distributed on the orthogonal group, we can assume without loss of 
generality that v = e\ for all the computations at stake. As a matter of fact, 
if 0\ is an orthogonal matrix such that 0\v = e±, then v'vi = e^OiVi = e^Vi 
where the matrix T = 0±T is again Haar distributed on the orthogonal 
group. 

So from now on, we assume (without loss of generality) that v = e\, and 
we therefore simply need to understand the correlation between (v i(l)) 2 and 
(u 2 (l)) 2 . Now, the first row of an orthogonal matrix uniformly distributed 
on the orthogonal group is a unit vector uniformly distributed on the unit 
sphere, because if O is Haar distributed, so is O' . We now recall the fact 
that a vector uniformly distributed on the unit sphere, v can be generated 
by drawing at random a A/"(0, Id p ) random vector and normalizing it. In 
other words, if Z ~ Af(Q, Id p ) , v = Z/\\Z\\ 2 . 

So our task has now been considerably simplified, and it consists in un- 
derstanding the covariance between 2 random variables, r\ and r 2 such that, 
if Zi are i.i.d. Af(0,l), 

' * v^P 7 2 • 

Now, by symmetry, E(rir 2 ) = E(rir,) for all i 7^ j and p(p — l)E(nr2) = 
Yli=tj^'( r i r j)- I n other words, 



We can therefore conclude that 



2-> 



p(p — l)E(nr2) = 1 — pE 



(-, 



V(ELi^?) 2 

Hence, E(nr2) < — 1))- On the other hand, 

ei L*Ji<*( z * ^- 



(Ef=i^)V- V(Ef =2 ^)V (p-3)(p-5)' 

since Ef= 2 ^~Xp-l» and E((x 2 _ i n = 2 r r((p-l)/2 + r)/r((p-l)/2), 
for r > -(p- l)/2 [see, e.g., Mardia, Kent and Bibby (1979), page 487]. 
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Applying these results with r = — 2 yields the above result as soon as p > 5, 
by using the fact that T(x + 1) = xT(x). We therefore have 

Since, for instance by symmetry E(r±) = l/p, and l/(p(p — 1)) — 1/p 2 = 
(p 2 (p— 1))~ , we conclude that 

1 3p 1 

< cov(ri, r 2 ) < 



p 2 (p— 1) p(p — l)(p — 3)(p — 5) ' p 2 (p— 1) 

We have therefore established the fact that 

|cov(ri,r 2 )| = 0(p~ 3 ). 
On the other hand, since E(r 2 ) = E(Z^(^ =1 Z 2 ) -2 ), we have 

°- Var(r ' ) - (p-3) 3 (p-5) "^ 

Now using the (standard) fact that, for symmetric matrices M, if a\(M) is 
the largest singular value of M, 

&i (M) < max | mjj | , 

3 

[it can easily be proved using, for instance, Theorems 5.6.6 and 5.6.9 in 
Horn and Johnson (1994), or Gersgorin's theorem (Theorem 6.1.1 in the 
same reference)] we have 

J -'-' r »4 ( P -3)( P -5) -?) +0( ^ > = "'' 2) - 

The first term in the previous bound comes from the contribution of the 
diagonal and the second term is the sum over the p—1 off-diagonal elements 
on a given row of the upper-bound we had on each such element, that is, 
Cp~ 3 for some C. 

Let us now return to our initial question which was to show that the 
conditional variance of interest to us was going to zero. Recall that g is a 
vector whose ith entry is 1/7$. Since 

var(z/S -1 i/|{7i},A) =g'cov(w)g, 

and cov(w;) =cov(r), we have, for C a constant, and if |||^4|||2 denotes the 
operator norm (or largest singular value) of the matrix A, 

var^'S-VK^A) < ||| cov(r)||| 2 || 5 || 2 < cMl = C \ 

P P ~[ li 

Now given the assumptions we made on A, according to the arguments 
given at the beginning of this proof and Lemma B.l in the Appendix, 7? > 
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£ n (l - y/p/{N -l)) 2 /2, where £ n = C (N- l)/(n- 1), with high ({y}? =1 )- 
probability. So we conclude that all the 7j's are bounded away [uniformly 
for A in jC e $ and with high ({y }^ =1 )-probability] from 0, and when this is 
the case, 

var(^ , 5" 1 z.|{ 7i },A)^0. 

Therefore, 

z/cS -1 ^ > — K7i}f=i 5 A — >■ in probability. 

P^7J 

Let us now show that this implies convergence in probability to (condi- 

II II 2 

tional on A only) of Q n = v'S~ x v — - YTi=i ^r- Let us call h n = C-^ 2 - = 
^?=i ~V For Cn t° be determined later, we have 

P(\Q n \ > e\A) < P(\Q n \ >e&h n < C„|A) + P(h n > C„|A). 
On the other hand, 

P(\Q n \ >e & / ln <Cn|A) = E(E(l|Q n | >B l hB < Cn |{^},A)|A). 
Because /i n is a function of the gi's and var(Q n |{7i}^ =1 , A) < /i n , 

E(l|o ft |> e lhn<Cnl{ft},A) = l^< Cn E(l| QTi | >£ |te},A) < l hn < Cn ^ < ^. 

But when A 6 £ e ,5, under our assumptions and their consequences on the 
7 2 's mentioned above [i.e., 7 2 > £ n (l - y/p/(N- l)) 2 / 2 wi th high {y}"=i 
probability], we have /i n |A = Op(l/p), so taking = n -1 / 2 , we have P(h n > 
Cn|A) — > and of course, Cn/^ 2 — >• 0. Hence, for any e > 0, 

P(|Q n | >e|A)^0. 

Let us now turn to the question of identifying the limit. 

• About lY.Uk- 

The Stieltjes transform of the spectral distribution of Y'LYj (n — 1) is 

1 v 1 

s p0) = -Y] — • 

The quantity ^ YTi=i h ls therefore s p (0) and we are interested in its limit, 
if it exists, which would correspond to s. 

Recall the Marcenko-Pastur equation, from Marcenko and Pastur (1967), 
Wachter (1978) and Silverstein (1995): if Y is n x p has i.i.d. entries with 
mean and variance 1 and L is positive semidefinite, has limiting spectral 
distribution G and is independent of y, if p/n — > p > 0, and if m p is the 
Stieltjes transform of the spectral distribution of Y'LY/p, then m p (z) tends 
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in probability) to m(z) for all z in C" 1 " and m satisfies 

1 1 f TdG(r) 



(5) 

m(z) p J 1 + Tm(z) 

Note that, if p/n = p n , we have 

PnSp(pnz) = m p (z). 

Therefore, according to Marcenko and Pastur (1967), Wachter (1978) and 
Silverstein (1995), we know that s p {z) converges for z € C + to a nonrandom 
quantity s(z), in probability. Note that s satisfies, in light of equation (5), 

TdG(r) 



s(z) J l + rps(z) 

Here, because we know using our assumptions (see the end of the proof) 
that 7i are bounded away from with probability going to 1, we can also 
conclude that s p (0) — > s(0) with probability going to 1, because of the weak 
convergence (in probability) of spectral distributions that pointwise con- 
vergence of Stieltjes transforms implies (as a test function, we can use a 
function that coincides with 1/x except in a interval near where we are 
guaranteed that there are no eigenvalues asymptotically) . We also know that 
s is continuous (and actually analytic) at in this situation since the s is 
the Stieltjes transform of a measure who has support bounded away from 0. 
So the previous equation holds for z = 0, and we have 

tcIG{t) 



a(0) J l + r/?s(0)' 

Multiplying both sides by —ps(0), we get, after we recall that G is a prob- 
ability measure, 

r rrsmoM = u i \ _ r i 

' J 1 + T PS (0) J\ 1 + T PS (0)J 1 ' J 1 + T P 8(0) { ' 

Calling s(0) =s, we have the result we announced, conditionally on A. Now, 
here G is the limiting spectral distribution of AHA, but because this matrix 
is a rank one perturbation of A 2 , these two matrices have the same limiting 
spectral distribution. This concludes this part of the proof. 
• Getting results unconditionally on A. 

All the statements above were made conditional on A. If we can show that 
our probability bounds and our characterization of the limit hold uniformly 
in A, we will have an unconditional statement, as we seek. 

The fact that the limit does not depend on A is essentially obvious from its 
description: all that matters is the limiting spectral distribution, which is the 
same for all A. Let us consider the question of uniform probability bounds. 
All we need to do is show that we control P{h n > Cn|A) uniformly in A. At 
this point, it is helpful to recall that N can be viewed as a function of A. 
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Recall also that if A G £ e ,8i 

Pa(V% < V5[(l - vT^i) - 1}) < exp(-(n - l)<ft 2 /C ). 

Hence, when A <E £ £i< 5, if Cn = n~ 1/2 , P(/i„ > C„|A) < f n (C ,e,5), where 
fn(Co,e,8) tends to as n tends to infinity. In other words, we have now 
established that if A G £ E ,5i an d Q n = v'S~ x v — - YTi=i f° r an y * > 0, 

P(|Q„|>i|A)<^ + /„(Q),£,5). 

Using the fact that P(\Q n \ > t) < P(\Q n \ > t & {A G £ M }) + P(A £ £ £)(5 ), 
we conclude that P(|Q n | > t) — > as n tends to infinity for any t > and 
the proof is complete. □ 

As a consequence of Theorem 4.1, we have the following practically useful 
result. 

Lemma 4.2. We assume that the assumptions of Theorem 4-1 hold and 
that G is such that s is not oo. 

Suppose that v\ and v 2 are deterministic vectors such that 

v\Y,~ 1 v 2 , v\Y,~ 1 v 2 

and 



(vi+v 2 )'T, 1 {v\+V 2 ) («i — U2)'S 1 {v\-v 2 ) 

are bounded away from 0. Then under the assumptions of Theorem 4-1, 

— — — : > 5 in probability. 

In other respects, suppose that v[T,~ 1 v 2 — > 0, while v' 1 T l ~ 1 vi and v' 2 T,~ 1 V2 
stay bounded away from oo. Then, under the assumptions of Theorem 4-1, 

v[fj~ v 2 — > in probability. 

Proof. The proof of the first part of the lemma is an immediate con- 
sequence of Theorem 4.1, after writing 

^v[T,~ 1 v 2 _ + V2)'Y*~ 1 (vi + v 2 ) (vi + v 2 )'Yl~ 1 (vi +v 2 ) 



v[E 1 v 2 (vi + v 2 )'T, 1 (vi+V 2 ) v^T, 1 V 2 

_ (Vl - V 2 )'£<~ 1 (vi - V2) (Vl - V2)''S~ 1 (V1 ~ V2) 

(vi - v 2 )'T,- l (vi - v 2 ) v' 1 T,~ 1 v 2 

For the proof of the second part, we note that Theorem 4.1 implies that 

v't~ l v = su'S^t) + opiv'^v). 

Note that since for i = 1,2, v' i 'E~ 1 Vi is assumed to stay bounded, the same 
is true of [v\ + ev2)''^~ 1 (vi + ev 2 ), where e = ±1. Now we write 

2v' 1 T,~ 1 v 2 = (vi + v 2 )'Y,~ l (vi + v 2 ) - (v\ - V2)'T,~ 1 (vi - v 2 ). 
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Our previous remark and the assumption of boundedness of v^T, Vi implies 
that, when v' 1 T,~ 1 v 2 — > 0, 

2v' 1 £~ 1 v 2 = s({ Vl + v 2 )'^ l { Vl + v 2 ) - Oi - v 2 )"^~ 1 (v 1 - v 2 )) + o P (l) 

= s2v' 1 Y,- 1 v 2 + o P (l) = o P (l). □ 

4.2. On quadratic forms involving p and S _1 . As is clear from the so- 
lutions of problems (QP-eqc) and (QP-eqc-Emp), when ft appears in the 
matrix V, its influence on the solution of our quadratic program will man- 
ifest itself in the form of quantities of the type /i'S -1 /} and t^S -1 /i. It is 
therefore important that we get a good understanding of those quantities. 

Compared to the Gaussian case, in the elliptical case, p is not independent 
of E anymore, which generates some complications. They are fully addressed 
in Theorem 4.6, but as a stepping stone to that result (the main of this 
subsection), we need the following theorem, which essentially takes care of 
the problem of understanding p'Y,~ l p for the class of elliptical distributions 
we consider when the population mean is 0. 

Theorem 4.3. Suppose Y is an nxp matrix whose rows are the vectors 
Yi, which are i.i.d. AA(0,Id p ). 

Suppose A is a diagonal matrix whose ith entry is A;, which is possibly 
random and is independent of Y . Call r\ = A 2 . We assume that n ^ for 
all i and 

Y n 1 n 

(Assumption-BLa) —z \ = — r rf — > in probability. 
n z n z z — ' 

i=i i=i 

If T (i) i- s the ith largest r^, we assume that we can find a random variable 
N € N and positive real numbers £o and Co such that 

{P(p/N < 1 — £o) - > 1) as n -> oo, 
P(t(n) > Co) -> 1, 
□770 > such that P(N/n > t]q) — > 1, as n—> 00. 

Let us call p n =p/n and p = lim n _ i , 00 p n . We assume that p £ (0,1). We 
call 

Z n p = ^e'AY(Y'A 2 Y/n)' l Y'Ae. 

Then we have 

Zn,p — > P in probability. 

If the nxp data matrix X is written X = AYE 1 / 2 ; and if m = S 1 / 2 y'Ae/re 
is the vector of column means of X , and ifT, is the sample covariance matrix 
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computed from X , we have 
- i^—i * 

m Zj m—¥K = 



P 



in probability. 



1-p 



Some comments on this theorem are in order. First, Z n ^ p is unchanged 
if we rescale all the Aj's by the same constant. So it appears we could as- 
sume that they are all less than 1, for instance, and dispense entirely with 
(Assumption-BLa). However, that would potentially violate the conditions 
of (Assumption-BB) which appear to guarantee that Z n ^ p has variance going 
to zero. We also note that because the Y^s have a continuous distribution 
and we know that all the Aj's are different from 0, the existence of Z n>p is 
guaranteed with probability 1. 

Some practical clarifications are also in order concerning the condition 



When the Aj's are i.i.d., this condition is satisfied (almost surely and hence 
in probability) if for, instance, the Aj's have finite second moment according 
to the Marcinkiewicz-Zygmund law of large numbers [see Chow and Teicher 
(1997), page 125]. This is very interesting from a practical standpoint as 
it basically means that we only require our random variables X, to have 
a second moment for the theorem to hold. We note that if there were no 
variance, the premises of the problem would be essentially flawed (after all 
the quadratic form we are optimizing involves a proxy for the population 
covariance, and, in the absence of a second moment for the Aj's, the popu- 
lation covariance would not exist), and hence we require minimal conditions 
from the point of view of the practical problem at stake. 

Finally, and remarkably, the limit of Z UtP does not depend on the empirical 
distribution of the Aj's. In particular, in the class of elliptical distributions 
(satisfying the assumptions of Theorem 4.3), the limit of •m'YT'^m is always 
the same: n = p/(l — p). 

We now turn to proving Theorem 4.3. The proof will be facilitated by the 
following lemma, which essentially gives us E(Z njP ). 

Lemma 4.4. Let Y be an nx p random matrix, with n>p with, for in- 
stance, independent rows, Y{. Assume that Yi have symmetric distributions, 

that is, Yi = — Yi. Let A be an n x n diagonal matrix with possibly random 
entries. Let P = AY (Y' A 2 Y)~ 1 Y' A be a random projection matrix. Y is as- 
sumed to be independent of A and Y and A are assumed to be such that P 
exists with probability 1. Then, 




in probability. 



i=l 



8=1 



E(e'Pe|A) = E(e'Pe) =p. 



In particular, the result applies when Yi are normally distributed, and A 
is such that (Assumption-BB) holds, and P is defined with probability one. 
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Proof of lemma 4.4. Let us note that P = f\{Yi,. ■ • ,Y n ). Now, con- 
ditional on A, P = f A (-Y 1 ,Y 2 ,...,Y n ) = P. However, P(l,j) = -P(l,j), if 
j 7^ 1. As a matter of fact, 

P(lJ) = X 1 X j Y{R2^Y i Yf\ Y r 

Hence, conditional on A, P(l,j) = — P(l, j). Now P is an orthogonal pro- 
jection matrix, P = P' , so all its entries are less than 1 in absolute value, 
the operator norm of P. In particular, all the entries have an expectation. 
Since, if j ^ 1, P(l,j) has a symmetric distribution (conditional on A), we 
conclude that 

E(P(l,j)|A)=0, if j ^ 1. 

Note that the same arguments would apply if 1 were replaced by i, so we 
really have 

E(P(z,j)|A)=0, iij^i. 

Therefore, 

E(e'Pe|A) = E(trace(P)|A) = p, 

since P has rank p and is a projection matrix. 

The same results hold when we take expectations over A by similar argu- 
ments. □ 

To prove Theorem 4.3, all we have to do (in light of Lemma 4.4) is to 
show that we control the variance of 

r> — €5 Pg. 

n 

We are going to do this now by using rank 1 perturbation arguments, in 
connection with the Efron-Stein inequality. 

Proof of Theorem 4.3. As before, we first work conditionally on A. 
We assume until further notice that A 6 £e ,<5 , a set of matrices which is 
defined at the end of the proof, will have measure going to 1 asymptotically, 
and is such that all the technical issues appearing in the proof can be taken 
care of. (The arguments are not circular.) 

We will use the notation 

n 

S = -Y. X l YkY k and S i = S--X 2 i Y i Yf. 
k=l 

Note that Si is symmetric and positive semi-definite. Naturally, in matrix 
form we can write S = (Y'A 2 Y)/n and Si = (Y'k}Y)/n, where A? is the 
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same matrix as A, except that Aj(i,z) = 0. Our aim is to approximate 
e'AY (Y'k 2 Y\- l Y'ke 

by a random variable involving only (Yi, . . . , l^-i, F+i, . . • , Y n ), that is, not 
involving Y{ . Using classic matrix perturbation results [see Horn and Johnson 
(1990), page 19], we have 

5 -i = ( Sl + ^WA _1 = sr 1 A * s ^ YiX ^ 1 



Z n ,p — — f{Xi, . . . ,X n ), 

n V n I n 



n l + X^Y/S^Yi/n) 

) n 



Of course, if is the ith. canonical basis vector in 



n 



W = AY = ^2 ^i Y I = Wi + XietXl. 



i=i 



Let us now call qi = Y-S i 1 Yi/n and r« = WiS i 1 Yi. We have 

(6) AYS'' = WtST 1 ~ + - Xh 6 -^§^ ■ 



Similarly, 



AYS^Y'A = WiS^Wl - \\ + Ke^ - A 3 



+ AiT^ - A^ * * h A^ng^e- - Ajng, 



This is, in some sense, the key expansion in this proof. Now let us call 
p! i = e'Wi/n and Wi = e'ri/n = fySf Y{. We have 

2 , 3 <7iWi Af A? (?? 

\ Z To 1 <■ 



n 1 + A 4 % n n 1 + Xfq 



Now let us call Zi = fi i S i jli. Clearly, Zi does not depend on Fj. Now, it is 
easily verified that 



We finally conclude that 



(8) 



Z n ,p — Z{-\ ( 1 — - x2 - J 
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We now recall the Efron-Stein inequality, as formulated in Theorem 9 of 
Lugosi (2006): if a = f(Xi, . . . , X n ), where the AVs are independent, and aj 
is a measurable function of {X\ , . . . , , , . . . , X n ) , then 

n 

var(a) <^E((a-ai) 2 ). 

8=1 

In particular, for us, it means that 

var(Z n , p |A) < ^E^Z n , p - Z* - ^ |a). 

If we now use equation (8) and the fact that qi > 0, we have 

1 



n 



Zn,p Z l 

n 



(1 - XiWi) 2 2 2 

- TT ^-<2(l + X iWi ). 



Moreover, conditional on Y(-{\ = (Yi, • • • , Yi+x, ■ • • , ^n) (and A since all 
our arguments at this point are made conditional on A), Wi is Af(0, fi^S^jli) 
when the Y's are jV(0, Id p ), because Wi = jx^S^ Y^. Therefore, 

EK 4 |A) = 3E((A^- 2 A l ) 2 |A). 

Almost by definition, we have ^S~ p>% < 1, since the vector e/y/n has norm 
1 and Wi(W[Wi)~ 1 W[ is a projection matrix (recall that 5, = W/Wi/n and 
jUi = e'Wi/n). So we would be done if we had uniform control on UltS" 1 )!^- 
Let us now go around this difficulty. 
• Regularization interlude. 

Let us consider, for t > 0, Z(t) = (x'{S + tld p )~ l fi, where jj! = e'W/n. 
Clearly, < Z(t) < Z n>p = Z(0), because S + ild p y S y in the positive- 
semidefinite ordering. In other respects, the decomposition in equation (8) 
is still valid if we replace Zj by Zj(i) and <Sj by Si(t) everywhere. However, 
HI («Si(t)) -1 HI 2 < l/t. We therefore have 

^m-'h < i^wi^ii^iii < < < \- 

So applying the previous analysis and using the fact that pI i {S i (t))-' 1 % < l/t, 
we conclude that 



o n / \4\ 

var(^)|A)<^E( 1 + 3 #J- 

i=l ^ ' 



So under our assumptions, Z(t) can be approximated, in probability, at 
least conditionally on A, by E(Z(t)|A). If we write the singular value de- 
composition of Wj\fn = Y^h=i a i u i v ii where u\ > 02 > • • • > a p , we have 
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WS^W'/n = YZ=i Uiu'i, W(S{t))- l W'/n = ££ =1 of /(a 2 +t)u i u' i , and there- 
fore 

t V 1 
0<2 n , P -Z(t) = -^^Ke) 



n af + 1 

1 = 1 



1 p 



< — - > (u'.e) 2 < 



a 2 + tn ^-f * cr 2 + f n <r 2 + 1 

To get the inequality above, we used the fact that the {«i}f =1 are orthonor- 
mal in K n , and can therefore be completed to form an orthonormal basis 
of this vector space. The quantities u[e are naturally the coefficients of e 
in this basis, and we know that their sum of squares should be the squared 
norm of e, which is n. 

Let us now call £ eo ,s the set of matrices A such that p/N < 1 — £q and 
Cq(N — l)/(n — 1) > 5. Under our assumptions, for a 5$ bounded away from 
(e.g., 5o = l/2Co?7o)) J°(A G £e ,5 ) ~~ * 1- Let us pick such a 5o- If A G 
£ £0j( 5 , according to Lemma B.l and the proof of Theorem 4.1, if Pa denotes 
probability conditional on A, 

Pa(<t p < V^[(l - VT^) ~ t}) < exp(-(n - l)5 t 2 /C ). 

Hence, when A 6 £e ,<5 , we can find, for any u > 0, an r](u) > 0, 

P(\Z n , P ~ Z(rj(u))\ >u)< f n (e ,5 ,r](u),u) = /„(u), 

where, f n (u) = fn(^o,^o,il( u )^ u ) -> as ?i -> oo, for fixed u. 

On the other hand, our conditional variance computations have estab- 
lished that, for any r\ > 0, Z(rj) — E(Z(r/)|A) converges in probability (con- 
ditional on A) to if ??~ 2 J2^t/ n2 tends to 0. We note that < Z n ^ p < 1 and 
that the same is true for 7 n (u) = E(Z(r](u))\A). Therefore, \Z n ^ p — 7n(w)| < 1 
and E((Z njP — 7 n (u)) 2 |A) goes to zero, since 

B((Z- ln (u)f\A) 

< u 2 P(\Z n , p - ln (u)\ < u\A) + P(\Z n>p - ln {u)\ > u\A) 

<u 2 + P(\Z n>p - Z( V (u))\ > u/2\A) + ^ var(Z(rKu))|A). 
In other words, we also have, if A S £ £o ,<5 , for any u > 0, 

on 1 n / \4 

var(Z n , p | A) < u 2 + f n {u/2) + ^ V 1 + 3- ^ 



u 2 n 2 ^— ' \ r](u)^ 

i=l 



Hence, if A G £ £0i ,5 and X^ILi ^f/ n2 ~ * var(Z|A) goes to zero as n goes 
to infinity, and we conclude that, since E(Z|A) =p/n, 

P 

Z > in probability, conditional on A. 

n 
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• Deconditioning on A. 

Let us call jC 2 q s t the set of matrices such that jC 2 q Sq t = C £0 ,So ^ {(^ x 
Yli=i ^t) — Our P rev i° us computations clearly show that we can find a 
function g n (u), with g n (u) — > as n — > oo, such that, for any u > 0, when 
A G ^eoA.^W 2 ~ £ 2 (")> var(Z njP |A) < 97u 2 + g n (u), and hence we have 
the "uniform bound," if A G £ 2 (it), 



P 



7 -t 

n 



> X \A)< 97u2+ J»M. 



Now under our assumptions, P(A G C 2 (u)) goes to 1 for any given u, so we 
conclude, using the fact that 

P(|Z n , p - p/n| > x) < P[\Z n , p -p/n\ >xkAe £ 2 (u)} 

+ P[A^C 2 (u)], 

that 

V 

Znv ^ hi probability. 

n 

This last statement is now understood of course unconditionally on A and 
this proves the first part of the theorem. 
• Proof of the second part of the theorem. 

We now focus on the m'S _1 m part of the theorem. Let us call 6 = X'X/n. 
Then, = & — mm'. Therefore, 



n -_ 1 i & 1 mm'& 1 
n — 1 1 — m'6~ L m 



Hence, 



-mS m= — — — = - 

n—1 1 — m'G L m 1 — Z. 



Since Z n p — > p in probability with p G (0, 1), we have the result announced 
in the theorem. □ 

Now that we have proved Theorem 4.3, we need to turn to results that will 
allow us to handle the case of nonzero population mean, as well as questions 
such as the convergence of p'T>~ 1 v, for deterministic v. 

4.2.1. On quantities of the type (p — \ p. Recall that the key quan- 
tity in the solution of problem (QP-eqc-Emp), the problem of main interest 
in this paper, is of the form V'T,~ 1 V. Therefore, it is important for us to 
understand quantities of the type 

C = pt^v, 
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for a fixed vector v. At this point, we focus on the particular case where 
H = E(Xj) = 0. To do so, we will need to study, if S = Y'A'AY/n, 

( = -e'AYS^v, 

n 

for a fixed vector v. As it turns out, this random variable goes to zero in 
probability when for instance ||i>||2 = 1- 

Theorem 4.5. Suppose v is a deterministic vector, with \\v\\2 = 1. Sup- 
pose the assumptions stated in Theorem 4-3 hold and also that 

1 - 

(Assumption-BLb) — Af remains bounded with probability going to 1. 
Consider 



n ■ 

i=l 



C = -e'AYS^v, 

n 

where S = ±Y'A 2 Y. Then 

n 

C — > in probability. 

Before giving the proof, we note that if the Aj's are i.i.d. and have a second 
moment, the "extra" condition on XT=i X i/ n introduced in this theorem (as 
compared to Theorem 4.3) is clearly satisfied by the law of large numbers. 

PROOF of Theorem 4.5. The proof is quite similar to the proof of 
Theorem 4.3 above. We start by conditioning on A. 

Let us call ((t) the quantity obtained when we replace S by S(t) = S + 

tld in the definition of Note that since Y is symmetric, £(t) = —£(£), 
conditionally on A, by arguments similar to those given in the proof of 
Lemma 4.4. Now clearly has an expectation (conditional on A), because 
|||5' _1 (i)|||2 < l/t, for t > 0, so E(C(t)|A) = 0. Now recall equation (6): with 
the notations used there, 



tc-l 



AYS- 1 = WiS~ l - ' \* + \eiYlS~ 1 - Xfq 



Let us now call = YfSi^Yi/n, w t (t) = e'W&ity^/n = ^(t)" 1 ^ 
and 6i(t) = Y?Si(t)~ 1 v. Clearly, if Q(t) is the random variable obtained by 
excluding Yj from the computation of ((t) (e.g., by replacing \ by 0), we 
have 

U } U J n l + XUm n n 1 + A? ft (t) 
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We remark that 0j (i) | (F(_j) , A) ~ J\f(Q, v'S~ 2 {t)v) and recall that Wi \ (Y"(_j) , 

A) ~ A^O,/^" 2 /^). Using the fact that ||u|| 2 = 1, |||^~ 2 (*) ||| 2 < t~ 2 and the 
remarks we made in the proof of Theorem 4.3, we get that E([9i(t)] 2k \(Y(_^, 
A)) < C k t~ 2k , E([wi(t)] 2k \(Y^_^, A)) < C k t~ k , where d = 1 and C 2 = 3. We 
also have 

[\Mm - \™m\ 2 < 2[A?^ 2 (t) + xte 2 (t)w 2 (t)]. 

Hence, simply using the fact that 2(ab) 2 < (a 4 + 6 4 ), we get 



E 



XiOii^il-XiWiit)) 



2 



1 + \ ? 9«W 

We conclude by the Efron-Stein inequality that, when A is such that 
£?=l ^/n 2 -> 0, for any t > 0, 

in probability, conditionally on A. 

As before, let us call C £o ,6 the set of matrices A such that p/N < 1 — eq and 
Cq(N — l)/(n — 1) > 5. Recall that under our assumptions, for 5q bounded 
away from (e.g., 5 = C r] /2), P(A G £ £o ,6 ) -> 1- 

As we saw before, when A £ C £o ,s , ||| 1 1|| 2 is bounded with high-probabil- 
ity (conditional on A), so we conclude that, for any r\ > 0, we can find a t 
such that 

IH5" 1 — 5~ 1 (i)|||2 < r\ with probability (conditional on A) going to 1. 

We also notice that conditionally on A, fi ~ AA(0, ^^Td p ) and hence, \\fi\\ 2 ~ 
Xp/ n (l2X 2 )/n. We recall that ||u|| 2 = 1, and since 

K -C(t)\< lights- 1 -s-^hMU, 

we conclude that with high-probability (conditional on A), for any 77 > 0, 
|£ — £(i)| < 77 and finally, 

(" — > in probability, conditionally on A. 

Now along the same lines as what was done in the proof of Theorem 4.3, we 
can make all these probability bounds uniform in A when A is in a set of 
matrices such as £ £o ,s an d when we also have bounds on J^^iXf/n 2 and 
Y17=i ^h/ u - Under our assumptions, the set of A for which these conditions 
hold has measure going to 1, so we can finally conclude — along the same lines 
(omitted here) as in the proof of Theorem 4.3 — that, unconditionally on A, 

£ — > in probability. □ 

After these preliminaries, we can finally state the theorem of main interest. 
Recall that under the assumptions of Theorem 4.1, if v is deterministic, 

v't~ l v . , , .,. 

— : > 5 m probability, 

v'T, L v 

where s is defined in equation (4). 
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Theorem 4.6. Suppose that Xi = p + \T}/ 2 Yi, where Yi are i.i.d. 
jV(0, Id p ) and {Aj}" =1 are random variables, independent of {Yi}f =1 . Let v 
be a deterministic vector. Suppose that p n =p/n has a finite nonzero limit, 
p and that p 6 (0, 1). 

We call Ti = Xf. We assume that n ^ for all i as well as 

\ Y^a=i %i ~ * * n probability and 
(Assumption-BL) " 

— X^i=i f^mains bounded in probability. 

If 77j) is i/ie zi/i largest Tk, we assume that we can find a random variable 
N G N and positive real numbers Eq and Co suc/i i/tai 

{P(p/N < 1 — e ) ->• 1, osn^oo, 
P(r(jv) > C ) ->■ 1, 
3r/o > such that P(N/n > 770) — >• 1, as n -> 00. 

VFe afeo assume that the empirical distribution of t% 's converges weakly in 
probability to a deterministic limit G. 

We call A the nx n diagonal matrix with A(i,i) = Xi, Y the nx p matrix 
whose ith row is Yi, W = AY and S = W'W/n = Y2=i X l Y k Y k/ n - Finally, 
we use the notation ui = W'e/n, p = E" 1 / 2 ^. 

Then, we have, for 5 defined as in equation (4), 

fi'£~ l v p't^v ,^ p'Z~ l v / p'Z- l v 

(9) , = = —, ; +op(1)=5 7 = +o P IV 1 



the second statement holding if, for instance, p and v are such that the first 
set of conditions in Lemma 1^.2 are met. 
Also, 

(10 /iS p = pT, p + - + 2 __p^ + op(l), 

1 — p n n 1 — uj'o 1 (jJ 

and we recall that u'S^^-p/WpW = op(l) and lu'S" 1 ^ = p/n + op(l). 

To be able to exploit equation (10) in practice, we make the following 
remarks. We can consider three cases, having to do with the size of p'T<~ 1 p = 

1. If p'E^p -> 0, then, p't~ l p = + o P (l). 

2. If p'Y,~ l p — > 00, then p'T," 1 /! ~ sp'TT 1 p. 

3. Finally, if p'T,~ 1 p stays bounded away from and infinity, 

//£ p=spT, p+- hop(l). 

1 - Pn 

A noticeable feature of these results is that the "extra bias" K n = p n /(l — p n ), 
which comes essentially from mis-estimation of p, is constant within the 
class of elliptical distributions considered here. This should be contrasted 
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with the "scaling," s, which strongly depends on the empirical distribution 
of the Af's. 

We now give a brief proof of Theorem 4.6. 

Proof of Theorem 4.6. We first note that T}/ 2 oj = rh in the notation 
of Theorem 4.3. Also, p = p + Y}/ 2 oj = p + m. Finally, 

— £ = T}I 2 ST}' 2 - mm' = ^(S - £u5')£ 1/2 - 
n 

Proof of equation (10). By writing £i = p + m, we clearly have 
p'T l ~ 1 p = p'YT^p + 2m'£ _1 /i + m'S _1 m. 

We have already seen in Theorem 4.3 that the third term tends to k = 
p/(l — p). On the other hand, half of the middle term is equal to 

Tt . , 1 

-to (o — cooo ) p. 



n — 1 

Since (S — QQ')^ 1 = S^ 1 + S^QQ' S~ l / {1 — Q'S~ 1 uj), we have 



-1 V l-Q'S^uJ 

Q'S- 1 ^- 1 / 2 , 



n 

1 



1 -Q>S- l u 

and we deduce the result of equation (10). We now remark that Q'S^Q 
is equal to the quantity in Theorem 4.3. The fact that cD^S" 1 ///!!/!!! = 
op(l) follows from applying Theorem 4.5 with v = Jl/\\p\\2- 

Proof of equation (9). The proof of this result follows from a decomposi- 
tion simila r to the one we just made. Clearly the only question is whether 
m'T l ~ 1 v/Vv'T l ~ 1 v goes to 0. As we just saw, 

J!L^M±-i v = — A^q's-^-^v. 

n — 1 1 — u S 1 lu 

The results of Theorem 4.5 guarantee that 

m probability. 



Since uj'S~ 1 uj tends to p < 1 and HE -1 / 2 -^! 2 . = v'Y>~ 1 v, we have shown the 
result stated in equation (9). □ 

4.3. On the effect of correlation between observations. It is clear that 
in financial practice and other applied settings, the assumption that the 
returns (or observed data vectors) are independent is often questionable. 
So for quadratic programs with linear equality constraints (including the 
Markowitz problem but also going beyond it), it is natural to ask what is 
the impact of correlation in our observations on the empirical solution of 
the problem. In our notation, this means that the vectors Aj and Xj are 
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correlated; we refer to this situation as the correlated case or as the case of 
temporal correlation. 

Our work on the elliptical case comes in handy here and allows us to also 
draw conclusions concerning the correlated case. We consider a particular 
model, namely we assume that the n x p data matrix X is given by 

X = e nt i' + AY£ 1/2 , 

where A is a deterministic but not necessarily a diagonal matrix, and Y is a 
matrix with i.i.d. A/"(0, 1) entries. We assume throughout that A is full rank. 
The model we consider now is more general than the one we looked at before, 
since if A = Id n , we get the i.i.d. Gaussian case, and if A is diagonal we are 
back in an "elliptical" case (where the ellipticity parameters are assumed 
to be deterministic, which amounts to doing computations conditional on 
A). But when A is not diagonal, Xi and Xj might be correlated. [In all 
the situations where A is deterministic, the marginal distribution of Xi is 
M(fi, •??£), where Sj is the norm of the ith row of A.] 

Because we want to focus here on robustness questions arising when going 
from independent Gaussian random variables to correlated ones, we will as- 
sume throughout that A is deterministic. (Allowing A to be random simply 
requires some minor technical modifications but would make the exposition 
a bit less clear.) Our main results in this subsection can be interpreted as 
saying that that the Gaussian analysis of Section 3, carried out in the set- 
ting of independent observations, is not robust against these independence 
assumptions. The results change quite significantly when the vectors of ob- 
servations are correlated. 

In general, we write the singular value decomposition of the n x n matrix 
A as A = ADB' [see Horn and Johnson (1990), page 414], where A and B 
are orthogonal, and D is diagonal. Therefore, AA' = Id n , and 

-(X - e nf i')'(X - e n //) = -Y}I 2 Y'BD 2 B'YY}I 2 k —T}/ 2 Y' D 2 YT}^ 2 . 
n n n 

So we are almost back in the elliptical case. The key difference now is that 
what will matter in our analysis are not the diagonal entries of A'A, but 
rather its eigenvalues (see Proposition 4.7). Also, we will see (in Proposi- 
tion 4.8) that the results change quite significantly when we look at quanti- 
ties like /i'X" 1 /}. 

4.3.1. On quadratic forms involving E _1 . As a counterpart to Theo- 
rem 4.1, we have the following proposition. 

Proposition 4.7. Suppose the n x p data matrix X (whose ith row is 
the ith vector of observations) can be written as 

X = e nt i' + KYT}/ 2 , 
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where A is a deterministic but not necessarily diagonal matrix. Suppose that 
the eigenvalues of A' A satisfy (Assumption-BB) with a deterministic N and 
that the spectral distribution of A' A converges weakly to a probability dis- 
tribution G. Suppose also that p/n — > p & (0, 1). Call £ the classical sample 
covariance matrix, that is, 

t = -(X - X)'(X - X). 
n 

Then, if v is a deterministic vector, we have 

v'T,~ 1 v . 

, > s in probability, 

where 5 satisfies, if G is the limiting spectral distribution A' A 

«w =i-„ 

1 + prs 

The proposition shows that Theorem 4.1 essentially applies again; how- 
ever, now what matters, unsurprisingly, are the singular values of A and 
not its diagonal entries. The proof of Proposition 4.7, or rather the adjust- 
ments needed to make the proof of Theorem 4.1 go through, are given in 
the Appendix, Section C.l. 

4.3.2. On quadratic forms involving (x and This is the situation 

where the results are most different from that of the uncorrelated case. Once 
again, here we will be content to just state the results; a detailed justification 
of our claims is in the Appendix, Section C.2. 

As before, the most complicated aspect of the problem is to understand 
quantities of the type /i'S -1 /^, in the situation where \x = 0. In this setting, 
we have the following result. 

Proposition 4.8. Suppose the nx p data matrix X is such that, for Y 
an n x p matrix with i.i.d. Af(0, 1) entries, and A a deterministic matrix, 

X = AYT, 1/2 . 

We assume that (Assumption-BB) holds for the eigenvalues of A' A, for a 
deterministic sequence N(n). We write the singular value decomposition of 
A as A = ADB'._ _ 

We call S = X X jn and rh = E^Y'A'e/ra, that is, the sample mean of 
the columns of X . We denote by di the diagonal elements of D, and Y = 
B'Y = Y. We also call 
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If we call oj = A'e, and qt = Y-F~ 1 Yi/n, we have, if \\uj\\\/n 2 and \\d\\\/n 2 — > 0, 
rh'S~ l rh — n(n,p) — > in probability, 

where 



1 n 1 
(n,p) = - J> 2 E(P(M)) and P(i,i) = l- — 
n t— i I + q 



i=l 

Further, 



mi m — > in probability. 

1 — K{n,p) 

Furthermore, under the above assumptions, if the spectral distribution of 
A'A converges to G and (^iLi^^?)/ 72 remains bounded, a result similar to 
Theorem 4-6 holds, with s being computed by solving equation (4) with the 
corresponding G and n(n,p) playing the role of p n /(l — p n ). 

Essentially the previous proposition tells us that when dealing with cor- 
related variables, the new n{n,p) replaces the old k = p/(l — p). We note 
that there are no inconsistencies with our previous results as ^2iP(i,i) = 
trace(P) =p and in the "elliptical" case (i.e., A diagonal), cof = 1, so the 
previous proposition is consistent with the results we have obtained in the 
elliptical case. We also remark that ||w|| = y/n, since A is orthogonal. 

Finally, in the case where the dj's have a limiting spectral distribution 
and satisfy (Assumption-BB), further computations show that qi — p n s — > 0. 
However, this does not help (in general) in getting a simpler expression for 
K(n,p). 

4.4. On the bootstrap. An interesting aspect of the analysis of elliptical 
models is that it also shed lights on the properties of the bootstrap in this 
context. As a matter of fact, the nonparametric bootstrap yields covariance 
matrices that have a structure similar to those computed from elliptical 
distributions: if we call D the diagonal matrix whose ith diagonal entry is 
the number of times observation X{ appears in our bootstrap sample, we 
have, if S* is the bootstrapped covariance matrix, 

77 

S* = X'DX £*(£*)', 

n — 1 n — 1 

where X is our original data matrix, and fi* is the sample mean of our 
bootstrap sample, which can also be written fi* = X'De/n. Unless otherwise 
noted, we assume in the discussion that follows that the population mean p is 
0. Since the covariance matrix is shift-invariant, we can make this assumption 
without loss of generality. We call 

6* = -X'DX and S* = XT 1/2 6*5r 1/2 . 
n 
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As we will see shortly, understanding the properties of £* boils down to 

understanding those of S* so we will focus on this slightly more convenient 

object in this short discussion. 

We note that if X is Gaussian, (3* can be thought of as a "covariance 

~ 1/2 

matrix" computed from the elliptical data Aj = d i Aj. The same remark 
applies when X is elliptical, that is, for us, A, = AjA/"(0,S): all we need to 
do is change the "ellipticity parameter" Aj to y/diXi- The same remark is 
also applicable to the case of correlated observations, that is, X = AYE 1 / 2 , 
where A is not diagonal anymore. Studying the bootstrap properties of such a 
model is the same as studying that of the model where we replace A by \fDK. 
We therefore would like to apply directly all the results we have obtained 
above in our study of elliptical models to better understand the bootstrap. 
For quantities of the form v' (it*)~ 1 v, we will see that we can essentially 
do it, but differences will appear when dealing with (/^/(E*) -1 /}*, which 
yields statistics that are not exactly analogous to corresponding statistics 
appearing in the elliptical case. 

Our focus will be on bias properties of bootstrapped replications, so we 
will aim for convergence in probability results and not fluctuation behavior. 
Our overall strategy here is to show convergence in probability of the quanti- 
ties we are interested in as functions of both the dj's and AYs. We will derive 
the convergence properties of our bootstrapped statistics by then condition- 
ing on the data and arguing that with high probability (over the Aj's), this 
does not change the results much. We first give some needed background on 
the bootstrap in Sections 4.4.1 and 4.4.2, then turn to properties of quan- 
tities like v'it^v (in Section 4.4.3) and finally study (/z*)'^*)- 1 /!* (in 
Section 4.4.4), where we will see (in Proposition 4.13) some key differences 
with the elliptical case. We conclude this subsection with a brief discussion 
of the parametric bootstrap and the conclusions that can be reached about 
it through our results. 

4.4.1. A remark on needed convergence properties. Making statements 
about bootstrapped statistics requires us to make statements that are con- 
ditional on the observed data. This is not a trivial matter for the statistics we 
deal with since they cannot be easily described in terms of simple formulas 
involving the original observations. However, we can take a roundabout way: 
by showing joint convergence in probability (joint here refers to the "new" 
data being the vectors of bootstrapped weights and observations), we can 
obtain interesting conclusions conditional on the data. Though this is not 
difficult to show, we give full arguments here for the sake of completeness. 

We will look at our statistics as functions of the number of times an 
observation appears in the sample and also, of course, of our observations. 
In other words, the original statistic, T n can be written 
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and, the bootstrapped version T* is, if observation Xi appears w* times in 
the bootstrap sample, 

T: = f(w* 1 ,...,w* n ,X 1 ,...,X n ). 
The following simple proposition is used repeatedly in our bootstrap work. 

Proposition 4.9. Let us consider a statistic T n = f(w\, . . . ,w n ,Xi, . . . , 
X n ), where Wi is the number of times Xi appears in our sample. Suppose 
that the vector of weights, w, is independent of the data matrix X . Denote 
by Q n the joint probability distribution of the Wi 's, V n the joint probability 
distribution of the Xi 's and lZ n = Q n x V n the probability distribution of 
(wt, ...,w n ,Xi,.. .,X n ). 

Suppose we have established that T n tends in 1Z n -probability to c, a deter- 
ministic object, as n — > oo . 

Then we have, with V n -probability going to 1 as n—> oo, 

T n \{Xi}™' =1 — > c in Q n -probability. 

In other words, calling X n = {Xi}f =1 , for all e,rj > 0, if Q n {e) = Q n (\T n — 
c| > e\X n ), V n {Q n {e) > rj) — > as n tends to infinity. 

In the case where the weights Wi are obtained by standard bootstrapping, 
Q n is multinomial(l/n, . . . , 1/n, n). Then, T n \X n has the distribution of the 
usual bootstrap quantity T*. We will focus on this case more specifically 
later. 

Proof of Proposition 4.9. The proof and the statement are almost 
obvious but we include them for the sake of completeness. Let us call r n = 
\T n — c\ and X n = {X±, . . . ,X n }. By assumption, r n — > in 1Z n probability. 
Hence, 

E nn (l Tn>£ )=B Vn (B Qn [l Tn>£ \X n ]) -> 0. 

Let us call Q n (e) = Q n (\T n — c\ > e\X n ). Clearly, < Q n (e) < 1 and 
Ep n (Q n (e)) -> 0, so for any rj > 0, 

V n {Q n {e) > rf) -)• 0. □ 

We now investigate the case of the classical bootstrap, that is, the situa- 
tion in which Q n is multinomial(^, ... ,-,n). 

4.4.2. Empirical distribution of bootstrap weights. As we saw in Theo- 
rem 4.1, the empirical distribution of the ellipticity parameters affect cru- 
cially statistics of the type v'T<~ 1 v, so to understand the effect of bootstrap- 
ping, we need to understand the empirical distribution of the bootstrap 
weights. This question has surely been investigated, but we did not find a 
good reference, so we provide the result and a simple proof for the conve- 
nience of the reader. 
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Proposition 4.10. Let the vector w be distributed according to a multi- 
nomial^ — , n) distribution. Call F n the empirical distribution of the 
vector w. Then 

F n =>- Po(l) in probability, 
where Po(l) is the Poisson distribution with parameter 1. 

Proof. Let us first start by an elementary remark: suppose wi, . . . ,ir n 
are i.i.d. with distribution Po(l). Call II n = X^ILi 71 "*- Then 

(tti, . . . ,7r n )|{II n = n} ~ multinomial ( — , . . . , — ,n ) . 

\n n ) 

This result is a simple application of Bayes's rule and the fact that II n ~ 
Po(n). 

Let us now show that if / is bounded and continuous, and if W ~ Po(l), 



E, 



1 - 

(/) = - E /K) ~+ E (/(^)) in Probability. 



n 

i=l 



To do so, we note that Wi ~ binomial(n, 1/n) and therefore its marginal 
distribution is asymptotically Po(l). Therefore, 

E(E K (/))^E(/(W0). 

Now all we need to do is therefore to show that var(Ei? n (_/")) goes to zero. 
Clearly, by independence of the 7Tj's, 

va M -E/fa) J =-var(/(W)) = of- 
because / is bounded. But our first remark implies that 



var(E Fn (/)) = var 



\ i=l 



n n = n 



Now, 

I E /(*) ) = E ( var f I £ /(*<) ) + var ( E f 1 £ 



var 



, i=l / V V i=l / / V \ i=l 



n n 



(i n 

> var - YV(7Ti) 
\ n 

\ i=l 



U n = njP(U n = n). 

Since H n has Po(n) distribution, P(H n =n) ~ l/y/2im. Hence, 

var(E Fn (/)) = var ^ £ /(vr,)!^, = n J = 0(n~ 1 / 2 ) -> 0, 
and the result is established. □ 
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We will also need later to use on the following (coarse) fact: 

Fact 4.11. Let the vector w be distributed according to a multino- 
mial^, . . . , -, n) distribution. Then 

P( max Wi> (log n)) =o( — -Y 
\i=l,...,n J \ (logn)! / 

In particular, this probability goes to faster than any n~ a , a > 0. 

The proof of the fact is elementary, and relies on the representation used 
above for the vector w, a simple union bound, the fact that P(Po(n) = n) ~ 
n- 1 ' 2 and the fact that P(Po(l) > M) < {M\)~ l M/(M - 1) which is easy 
to see by writing explicitly the probability we are trying to compute. 

With these preliminaries behind us, we are now ready to tackle the ques- 
tion of understanding the (first-order) bootstrap properties of the statis- 
tics appearing in the study of quadratic programs with linear equality con- 
straints. 



4.4.3. On inverse covariance matrices computed from bootstrapped data. 
Our aim in this subsubsection and the next is to find analogs to Theo- 
rems 4.1 and Theorems 4.6. Our first result along these lines is an analog of 
Theorem 4.1. 

We present the result in the case of Gaussian data, where we can get a 
somewhat explicit expression for the quantity we care about, and discuss 
possible extensions below. 

Theorem 4.12. Suppose we observe n i.i.d. observations Xi, where X{ 
are i.i.d. in W with distribution M(p,T, p ). Call p n =p/n and assume that 
Pn P G (0, 1 — e _1 ). Call S* the covariance matrix computed after boot- 
strapping the Xi 's. Call V n the joint distribution of the X{ 's. 

If v is a (sequence of) deterministic vectors, then conditional on {Xi}f =1 , 
with high V n probability, 

— — — : > s in probability, 

v'T, 1 v 

where 5 satisfies, if G is a Po(l) distribution 

V ' J 1 + prs 

Proof. As before, we call Q n the law of the bootstrap weights [i.e., 
multinomial^, . . . , ^,n)] and TZ n = Q n x V n . Without loss of generality, we 
can assume that p = 0. Let us call D the diagonal matrix containing the 
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bootstrap weights. We have ft* = X'De/n. Also, it is true that 



n — 1 \ n J \ n 

Since e'De = n, we also have 

(n - 1)E* = X'D (id - X = X'D 1 ' 2 (id - ^-D^ee'D 1 / 2 ^ D l ' 2 X. 

Because X is of the form X = YY, 1 / 2 under our assumptions, we see that 

t* = t} /2 s*t} /2 

where S* = —Y'D 1 ' 2 (id- -D l ' 2 ee' D 1 ' 2 ) B X ' 2 Y. 
n — 1 \ n J 

If we call 5 = D l l 2 e, we have = n because the sum of the bootstrap 
weights is n. Therefore, Hs = ld n — 55' /n >z 0. Also, H$ (like H) is a projec- 
tion matrix and a rank 1 perturbation of Id n . 

The situation is therefore very similar to the question we studied in Theo- 
rem 4.1, except that H = Id n — ee'/n is replaced by Hs = ld n — 55' jn. All the 
arguments given there hold provided we can show that (Assumption-BB) is 
satisfied for the bootstrap weights in the situation we have here. 

Now let us call iV the number of nonzero bootstrap weights. In the nota- 
tion of Theorem 4.1, Aj = yfdl and n = di. So clearly, Tr N \ > 1. So Co = 1/2 
is a possibility. Also, N/n — > 1 — 1/e in probability, so p/N has a limit in 
probability and this limit is bounded away from 1 because of our assumption 
that p n — > pE (0,1 — 1/e). Finally, we can pick r/o = (1 — l/e)/2. 

So the proof of Theorem 4.1 applies [it is easy to see here that the as- 
sumption that n 7^ can be dispensed of, because we know that the nonzero 
t,'s are large enough for our arguments to go through, and there are enough 
of them that we do not have problems (at least in probability) with S _1 not 
being defined], and we have the announced result. □ 

The previous theorems settled the question of understanding the impact of 
the nonparametric bootstrap on statistics of the form v'ib~ 1 v in the situation 
where the original data were Gaussian. A similar analysis could be carried 
out in the case of elliptical data, when we assume that the "ellipticity" 
parameters, Aj, are such (Assumption-BB) is satisfied for the "new weights" 
Tj = \ 2 Wi. The result would then depend on the limiting distribution of \ 2 wi 
(if it exists), where iOj is the bootstrap weight given to observation i. 

AAA. Bootstrap analogs of Theorems 4-5 and 4-6. An important piece 
of our analysis of quadratic programs with linear equality constraints when 
the data are elliptically distributed was the study of quadratic forms of the 
type /t'S _1 /i. It is natural to ask what happens to them when we bootstrap 
the data. In the elliptical case, we saw that the key statistic was of the form, 
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when p = and & = E 1 / 2 Y'A 2 YE 1 / 2 /n, 

A'©" 1 /} = -e'AY(Y'A 2 Y)- 1 Y'Ae. 
n 

However, in the bootstrap case, if A is the diagonal matrix containing the 
bootstrap weights, we have &* = E^Y'AYZ 1 / 2 /^ but p* = Y'Ae/n, so the 
key statistic is going to be of the form 

(A*)'(6*) -1 (A*) = -e'Ay(y / Ay)- 1 y'Ae. 

n 

This creates complications because the matrix AY (Y 1 'AY)~ 1 Y' 'A is not a 
projection matrix, and hence some of our previous analysis cannot be applied 
directly. However, this statistic can be rewritten, if we denote w = A 1 / 2 e, as 

-w'A^YiY'AYy'Y'A^w = -w'P A1/2 w, 
n n 

where P\i/2 is now a projection matrix. As before its off-diagonal elements 
have mean (conditional on A), but now we also need to understand 
Y17=i w iPi,i/ n an d n °t om Y Yl?=i^'i,i/ n - A detailed analysis of the former 
quantity is done in Appendix C.3. 

We naturally now assumes that p/n has a finite limit, p in (0, 1 — 1/e). 
As explained in Appendix C.3, Y^h=x w iPi,i/ n ~ * ( s ~~ l)/ s m Qn-probability, 
with V n probability going to 1, where 5 is computed by solving equation (11) 
[i.e., using Po(l) for G in that equation]. 

Similarly, it is explained there, that with V n probability going to 1, when 
Xi have mean 0, 

(p*)'{t*y l p* -> S -1> in Q n -probability. 

Finally, an analog of Theorem 4.5 holds, so we have an analog of Theo- 
rem 4.6, where s is as defined above, and p n /l — p n needs to be replaced by 
8-1. 

In summary, we have the following proposition. 

Proposition 4.13. Calls the quantity defined by equation (11). 

Suppose the data X\, . . . , X n is i.i.d. M(p, £), and call V n the correspond- 
ing probability distribution. Suppose v is a given deterministic sequence of 
vectors. Under the assumptions of Theorem 4-12, we have, when bootstrap- 
ping the data, with V n probability going to 1 

v'{t*)- l v 



Vv'T, ' 



in Q, n -probability, 
in Q n -probability, when p = 0, 



(/}*)'(£* - s^'S-V + (s - 1) + o e „(^/i'S-V, I)- 
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We note that our techniques could yield generalizations of the previous 
fact for the case where the data is elliptically distributed. However, in the 
case where Xi have mean 0, the quantity (/i*)'(£*) -1 /i* does not seem to 
have a limiting value that is writable in compact form, so we do not dwell 
on this question further. 

Naturally, the motivation behind the previous proposition is practical 
and the results are interesting from that standpoint. They show that the 
bootstrap yields inconsistent estimators of the population quantities, some- 
thing that is not completely unexpected when we understand the random 
matrix aspects of these questions. Perhaps even more interesting is that 
bootstrap estimates of bias are themselves inconsistent: as a matter of fact, 
the key quantity that measures bias in the Gaussian case is 1/(1 — p/n); 
when bootstrapping it is replaced by s, as defined in equation (11). These 
results therefore cast some doubts on the practical relevance of the boot- 
strap for the high-dimensional problems we are considering, at least when 
the bootstrap is used in "classical" ways. 

4.4.5. On the parametric bootstrap. In the settings considered here, it is 
also natural to ask how the parametric bootstrap would behave. For instance, 
if we assumed Gaussianity of the data, we could just estimate £ and \x (e.g., 
naively, by £ and fi) and use a parametric bootstrap to get at the quantities 
we are interested in. 

Naturally, the analysis of such a scheme is similar to the analysis of the 
Gaussian case carried out in Section 3, where the population parameters 
£ and [i need to be replaced by the estimators we use in our parametric 
bootstrap. The same would be true if we were to do a parametric bootstrap 
for elliptical data, but we would have to use the results of Section 4 instead. 

Our computations show that the parametric bootstrap could be used in 
the problems under study to estimate the bias of various plug-in estimators: 
we would for instance recover the correct s by considering w'(S* aramctl . ic )~ 1 f/ 

v'Y,~ 1 v. We note, however, that our analyses, and the estimation work we 
carry out in Section 5 could do this too, at a cheaper numerical cost. 

Finally and very interestingly, we see that a naive use of the parametric 
bootstrap to estimate the bias in the empirical efficient frontier — a perhaps 
reasonable idea at first glance — would yield inconsistent estimates of bias. 

5. Robustness, bias and improved estimation. We now go back to our 
original problem, which was to understand the relationship between the so- 
lution of problem (QP-eqc-Emp) and the solution of problem (QP-eqc-Pop) 
(see page 10 for definitions). 

It is naturally important to understand the effect of making the assump- 
tion that the data is normally distributed as compared to, say, an assumption 
of elliptical distribution for the data. The following discussion fleshes out 
some of our theoretical results and what their significance is when solving 
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quadratic programs with linear constraints. The discussion is an application 
of the work done in Sections 2-4. It might appear to be mainly heuristic, 
but precise statements can be easily deduced from the precise statements of 
the theorems given in the corresponding technical sections. 

We discuss here only the case of i.i.d. data. As we have shown above, the 
bootstrap case and the case of correlated observations are more complicated 
to handle, and the formulas are not as explicit in those cases as they are in 
the case of i.i.d. data. But for certain cases, one could plug-in our earlier 
results for those situations to obtain explicit results about efficient frontiers 
and weight vectors in those cases too. 

As a matter of notation, all of our approximation statements hold with 
high-probability asymptotically, unless otherwise noted. We will carry out 
our work under the model put forward in Theorem 4.1, assuming that the 
Aj's are i.i.d. and the following assumptions: 

Assumption Al: for all i £ {1, . . . , k}, v' i T,~ 1 Vi stays bounded away from 0. 

Vk is assumed to be equal to p. 
Assumption A2: the smallest eigenvalue of M = V'Yr^V stays bounded 

away from and the condition number of M remains bounded. 
Assumption A3: if e = ±1, (vi + evj)'T,~ 1 (v , + evj) stays bounded away from 

infinity. 

Assumption A4: (Assumption-BB) and (Assumption-BL) hold. (See Theo- 
rem 4.6 for definitions.) 
Assumption A5: we have, for some e > 0, if u n = (21og(n) + (logra) e ) 1//2 + 

y/2n, u n ~~ ^ 0; where |||S|||2 is the largest eigenvalue of S. 

These assumptions guarantee that the noise terms involving jl do not over- 
whelm the signal terms involving p, and also that we can safely take inverses 
of our approximations to get approximations of their inverses. Also, all the 
key results we obtained in Sections 3 and 4 are applicable, and our conclu- 
sions will of course heavily rely on them. 

We will use the notation p n =p/n. We recall that in the Gaussian case, 
the quantity 5 appearing below is approximately equal to 1/(1 — p n ) and in 
the elliptical case, it is always greater than 1/(1 — p n ), as we explained after 
the proof of Theorem 4.1. 

We start by investigating the case of equality constraints. We discuss 
inequality constraints in Section 5.6. 

5.1. Relative positions of efficient frontiers: Gaussian vs. elliptical case. 
When assumptions (A1-A4) hold, it is clear that 

(12) M = y / s~ 1 y~si/'s~ 1 y + — —e k e' k . 

1 - Pn 

Now recall that in the elliptical case, s > 1/(1 —p/n) = s G , that is, the "s" 
corresponding to the Gaussian case. Calling Me the empirical estimator of 
M we get in the elliptical case and Mq its analog in the Gaussian case, we 
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have, when A1-A4 are satisfied, with high-probability, 

M E y M G , 

at least asymptotically. 

We now call /imp and /imp the "efficient frontiers" obtained by solving 
problem (QP-eqc-Emp) when the data is respectively elliptical and Gaus- 
sian. Recall that under our assumptions, /i and X are the same for the two 
problems, so the population version corresponding to the two problems is 
the same. We call the population solution, that is, the efficient frontier com- 
puted with the population parameters, /thco- Naturally, this is the quantity 
we are fundamentally interested in estimating. 

Using the fact that / em p = U'M~ 1 U, the following important results. 

Theorem 5.1. When assumptions A1-A4 are satisfied, we have with 

high-probability and asymptotically, 

f(E) < AG) < , 

J emp — Jemp — Jtheo- 

In other words, risk underestimation in the empirical quadratic program with 
linear equality constraints is least severe (within the class of elliptical models) 
in the Gaussian case. 

In other respects, we have, asymptotically, with high-probability, if K n = 
Pn/(1 - Pn), 

(13) f ™» ~ 8 ( Ahc ° " Tl + («!/sKA/-ieJ • 

Another way of phrasing this result is the fact that the Gaussian analysis 
gives the most optimistic view of risk underestimation within the class of 
elliptical models considered here. 

Practically, it means that users of Markowitz-type optimization should 
be wary of the empirical solution they get, and even of the correction that 
Gaussian results suggest. If the data is elliptical, they will underestimate 
the risk of their portfolio even more than the Gaussian results suggest. 

Let us now give a proof of Theorem 5.1. 

Proof of Theorem 5.1. Under the assumptions of the theorem, we 
can use the approximation in equation (12). The first part of the theorem 
has been argued before, so we do not need to do anything else to obtain it. 

The second part follows directly from a rank one perturbation argument. 
We have 

= -U'(M + ^e k e[) V. 



5 \ 5 
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Using the classic result (M + vv')~ l = M" 1 - M~ x vv'M~ x /(l + u'M~ x u), 
we conclude that 

1 - n (U'M^ek) 2 



U' (m + -y e fc e^ U = U'M~ X U 



5 l + (n ri /s)e' k M- l e k 

We now recall from Section 2 that /theo = U'M~ 1 U, and we have the an- 
nounced result. □ 

Equation (13) naturally suggests better ways of estimating /theo than 
using /emp- We postpone a discussion of this issue to Section 5.4 because it 
requires somewhat lengthy preliminaries. 

5.2. Issues concerning the weights of the portfolio. Besides problems in 
the location of the efficient frontiers, our analysis reveals another very inter- 
esting phenomenon: problems with estimating Wthecn the optimal vector of 
weights. In particular, one can show that the mean return of the portfolio 
is poorly estimated and the weight given to each asset is biased. 

Theorem 5.2 (Bias in weights). Suppose assumptions A1-A4 hold. We 
have, asymptotically and with high-probability, 

(14) ^cmp ^ w thco - ((s)—w b , 

5 

where 

e ' M~ l U 

1 + (K n /s)e' k M L e k 

This approximation is valid when looking at linear combinations of the vector 
of weights: i/76 W 1 is deterministic and assumption A3 extended to include 
this vector holds, 



l'w emp 



l'(w theo - ((s)^w b ^j +o P (l). 



We note that the last assertion of the theorem does not necessarily im- 
mediately follow from equation (14) in high-dimension, but it is true in 
the setting we consider. A particularly interesting corollary is the following 
statement concerning inconsistent estimation of the returns. 

Corollary 5.3 (Poor estimation of returns). Recall that with our no- 
tations, w' theo fi = Ufc = fip. In practical terms, \ip corresponds to the desired 
expected returns we wish to have for our "portfolio. " Under the same as- 
sumptions as that of Theorem 5.2, we have 

V ™p-^ 1 + (/tn/s)e / M -i efc s 1 + ( Kn / s ) e >M-W 
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The previous corollary is a statement about poor estimation of returns 
for the following reason: p,'w cmp = fj,p by construction, so one might naively 
hope that, for a new observation X n+ \, independent of X%, . . . , X n and with 
the same distribution as them, E(w' emp X n +i\Xi, . . . ,X n ) = w' emp fj, ~ [ip. 
However, as the previous corollary shows, this is not satisfied. We note that 
the factor affecting /ip is a shrinkage factor, always smaller than 1 because 
M is positive semi-definite. The other term could have either sign, so its 
effect on return estimation is less interpretable. For large ^p, it is nonethe- 
less clear that the previous corollary shows that the returns are overesti- 
mated: the realized returns are (asymptotically and with high-probability) 
less than [ip. Hence, our result can be seen as a generalization of the overes- 
timation of returns result first found in [Jobson and Korkie (1980)], in the 
low-dimensional Gaussian case. 

We now prove these two results. The proof of the corollary is at the end 
of the proof of the theorem. 

Proof of Theorem 5.2. Under the assumptions of the theorem we 
have 

and our assumptions guarantee that we can take inverses and still have valid 
approximations. Hence, using the classic formula for inversion of a rank one 
perturbation of a matrix [see Horn and Johnson (1990), page 19], we have 



5 l + (K n /s)e'M~ 1 e k 



Now recall that w; emp = S 1 VM 1 U and Wtheo — ^ 1 VM 1 U. For a deter- 
ministic 7, our work in Section 4 indicates that "y'Y>~ 1 V = S7 / £~ 1 V + op(l). 
So we conclude that 

7Vmp = n'S-V- M - k k , U + op 1 . 

A s 1 + (n n /5)e' k M- 1 e k J 

In other words, we have 

7^emp = 7S VM U-- — , * +Op 1 , 

s 1 + (K n /s)e' k M l e k 

or, as announced, 

7 »emp = 7 ^theo 7 ^bC(s) + Op(l). 

S 

It seems difficult to say more, because u>b and £ are population parameters 
and their properties and values may vary from problem to problem. 
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• Proof of the corollary. 

We now assume that 7 = fx. We remark that \x = Ve k , by construction 
of V. Therefore, 

p'w b = e' k V'Yr 1 VM- x e k = e' k MM- l e k = 1. 

Further, 

k 

e k M~ x \J = J2u i e' k M- 1 e i = ^u i e' k M- 1 e i + p P e' k M- l e k . 

i=l i<k 

These two remarks and the result of Theorem 5.2 give the conclusion of the 
corollary. □ 

5.3. Bias correction for the weights. An important question now that we 
have identified possible problems with the empirical weights is to try and 
correct them. We propose such a scheme, suggested by our computations. 

Our investigations will rely on the following asymptotic result, discussed 
in Theorem 5.2: in the notations of this theorem, 

= 7'^thco - — 7'w b ((s) + Op(l). 

s 

Our efforts will focus on trying to estimate w b /s and C( s )> as K n = p n j (1 — 
p n ) is known and computable from the data. 

Recall that we assumed that vt = \i and let us call 

M = M- K n e k e' k . 

Under the assumptions underlying the previous computations, we have 

M ~ sM. 

In practice, we wish M to be a positive semi-definite matrix — something that 
is guaranteed asymptotically, but might require checking and potentially 
corrections in practice. 
We propose to use: 

1. As an estimator of w b , 

w b = £- 1 VM~ 1 e k . 

2. As an estimator of Q(s)/s, 

e' k M- l U 
z = 2 _ . 

1 + K n e' k M-^e k 

For any deterministic 7 (such that the assumptions of Theorem 5.2 hold), 
7 / u)b~7 / w, because j't^V ^s^E^V and M~ X XJ ~ M^U/s. Also, 
e k M~ l \J ~ s _1 e' fc M _1 Z7, and e' k M~ l e k ~ 5~ 1 e' k M- 1 e k , so z ~ ((s)/s. Hence, 
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In other words, we have found an asymptotically consistent way of estimating 
the quantities of interest. Hence, the estimator we propose to use is 

(15) w^r o = (w cmp + K n zw b ) = f,~ 1 VM~ 1 U. 

Interestingly, this proposal does not require us to estimate s. Furthermore, 
because we have consistency of the estimator in the whole class of elliptical 
distributions, this estimator is fairly robust to distributional assumptions 
about the data. Finally, the estimator is consistent in the sense that all 
(deterministic and given) linear combinations of Wtheo are consistent for 
the corresponding linear combinations of u>theo (provided these linear com- 
binations are such that the assumptions of Theorem 5.2 apply to them). 
(Naturally, we cannot take a supremum over too large a class of 7's.) 

The estimator satisfies the constraints. It is nonetheless natural to raise 
the following question: does the proposed estimator satisfy the constraints 
of the problem? If not, our proposal would be problematic, but it is in- 
deed the case that our estimator satisfies the constraints Wtheo'vi = Ui for 
alH € {1, . . . , k — 1}. Naturally, the last constraint (i.e., w^ Q ' fi = u k = fip) 
is difficult to satisfy exactly because fi is unknown, so it is also less of a 
concern. 

Let us now briefly justify our claim concerning the satisfaction of the 
equality constraints. By construction, w emp satisfies the constraints w' emp v j = 

Ui, 1 < i < k — 1, so all we have to show is that the k x 1 vector V'wb is pro- 
portional to e k . We recall that M = M — K n e k e' k , so 



w b : 



t^V(M-K n e k e' k )- l e k . 



Using the standard formula for the inverse of a rank-1 perturbation of a 
matrix, we therefore get 

V l-K n e' k M- l e k ) 



t^VM^eu + K ri t- l VM~ l 



e'fcM^efc 



1 - K n e' k M l e k 



-ir^M-^fc. 



1 - K n e' k M 1 e k 

Once we recall that M = V'Yi^V, we immediately get the equality 



V'w b 



-efc, 



1 - K n e' k M 1 e k 
which shows that v'^Wb = for 1 < i < k — 1, as announced. 
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Finally, from a practical point of view, one might be worried that the 
estimator proposed in equation (15) "puts too much weight on the theory 
and not enough on the data" and that better practical performance might 
be achieved by tuning more finely our corrections to the data. For instance, 
one might propose, we think reasonably, to use, instead of M the matrix 
M(X 1 ) = M- Ai 

^n e fc e l) where Ai would be picked by some form of cross- 
validation based on the new estimator 55theo(Ai) = w cmp + K n z(Xi)wb(Xi) . 
We do not discuss this issue any further in this paper as we plan to address 
it in another, more applied, article. We do, however, show the performance 
of our estimator in limited simulations in Section 5.5. 

5.4. Improved estimation of the frontier. We now discuss the question of 
improved estimation of the efficient frontier. This is naturally an important 
quantity in the problem, and, as we hope to have shown, a difficult one to 
estimate by naive methods. One aspect of its importance is that it gives us a 
benchmark of performance for optimal portfolios. We therefore think that in 
a financial context, it might be of great interest in particular to regulators. 

5.4.1. Estimation of s. Though we have seen that we could devise a 
scheme to improve the estimation of the weights without having to esti- 
mate 5, this latter quantity is still an important one to estimate if we want 
to better understand the pitfalls we might be facing. 

In the elliptical case, where Xi = fi + Xi^Yi, we wish to estimate A; , as 
we have seen that s is "driven" by this quantity. We now describe heuristics 
that suggest how to estimate s; more detailed consistency arguments follow 
in Proposition 5.4. To estimate s, we recall that standard concentration of 
measure results (see below) say that with very high probability, if the largest 
eigenvalue of £ does not grow too fast, 

ll sl/2 ^lll _ trace(S) 
p p 

Hence, in this setting, the concentration of measure phenomenon can be 
used for practical purposes. Now, note that — A III — trac °( s ) ; because 
under our assumptions A1-A4 and the assumption of independence of the 
Aj's, Y17=i ^i/ n ^ 1 an d A5 implies that the previous approximation holds. 
Hence, 

ll^j — Alii , 2 t race (^) 

— A; ■ 

V P 

We now propose the following estimator for Xf: 

^2 _ ~ A 1 1 2 _ — A 1 1 2 

* "ELlll^-All!/™ - trace(S) ' 
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If we denote p n =p/n, we then propose to estimate s using the positive 
solution of 

1 n 1 

g n (x) = l-p n where g n (x) =-^J — — . 

n i= i 1 + xXfp n 

We note that this is just the discretized version of the equation characteriz- 
ing s. (g n is clearly a continuous convex decreasing function of x on [0, oo), 
so the existence and uniqueness of a solution to g(x) = 1 — p n is clear.) 

5.4.2. Estimation of the efficient frontier. We recall an important result 
from Theorem 5.1: under the assumptions made in this section, 

1 

S 1 + (K n /S)e' fc M ^fc 

Now recall that we have a consistent estimator of e' k M ~ 1 /s, that is, e^,M , 
and we just discussed how to estimate s. 

As an estimator of the efficient frontier we therefore propose 

— _J {e' k M~ l U) 2 

Jtheo — * I Jemp t — 

V l + K n e' k M~ 1 e k 

We also note that M could be replaced by M(Ai) described above with 
a similar cross-validation scheme. 

5.4.3. Consistency of the estimator of s. Let us now show that our pro- 
posed estimator of s is consistent. We place ourselves in the setting where 
Aj's are i.i.d. with a second moment and E(A?) = 1. Recall also that the Y^'s 
that appear below are such that Yi ~ A/"(0,Id p ). 

We have the following proposition. 

Proposition 5.4. Let us call u n = (21og(ra) + (logra)^) 1 / 2 + \/2tt and 
|||S|||2 the largest eigenvalue o/S. Then we have, with probability going to 1, 



max 

Ki<n 



\\^ 1/2 Yi\\ 2 2 trace(S) 



<Hi( 4 + u 2 ) + 2Un 
P 




p p 
Further, if s n is the solution of g n (x) = 1 — p n , 

5 n — > 5 in probability, 
as soon as, for some e > 0, j~~|f^ ^ — * as n^- oo. 

Proof. Let us consider the function F(Y) = ||£ 1 / 2 Y' \\2/y/P- Clearly this 
function is ||S 1 / 2 ||2/ v / p-Lipschitz with respect to Euclidian norm in MP. 

Now suppose that Yq ~ A/"(0, Id p ). Let us call mp a median of F(Yq). 
Using standard results on the concentration properties of Gaussian random 
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variables [see Ledoux (2001), Chapter 1 and Theorem 2.6], we have 



P 



\^ 1/2 Y \ 



Vp 



mp 



> t < 2 exp 



pt 2 



2 £ 



Hence, using a simple union bound argument, we have, after some algebra, 



if tr 



S||| 2 /p(21og(n)+log(n) £ ) 1 /2 ) 



P I max 

, Ki<n 



VP 

So with large probability, 



mp 



>t n <2exp(-(log(n))72). 



max 

Ki<n 



i 2 



Vp 



mp 



<t n . 



Now, if we call \xp = E(F(Yq)), we have, using Proposition 1.9 in Ledoux 
(2001), 



\mp — fip\ < V^\ 



'2IISII 



P 



trace(S) 9 , III £ III 2 

< — — fip < 4^^. 

P P 

Now, using the fact that maxi<,< n |a? — b 2 \ < maxi<j< n \ai — b\{2b + 
maxi<i<„ |aj — and the fact that //f < Y / trace(S)/p, we have 

\Z 1/2 Yi\\l trace(S) 



max 

Ki<n 



< max 

Ki<n 



p p 

S^Y.IU 



vp 



2v trace(£)/p + max 

l<i<n 



|SV2v. 



i 2 



VP 



Our previous results imply that with large probability, 

l|£ 1/2 ^l| 2 



max 

Ki<n 



Vp 



lip 



< 



and therefore, with large probability, 
\^ 1/2 Yi\\l trace(S) 



max 

Ki<n 



p 



llisih 



ISII 



P 

p 



p 



-u n 2i 



'trace(S) /|||S 



+ 



-u. 



P 



(A + u 2 n ) + 2u n] 



\Tj\\\2 / trace (£) 



P 



P 
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as announced in the proposition. As a consequence, we have, if we call v n 



|S|||2/ trace(E)u n , with large probability, 



max 

Ki<n 



trace(E) 



1 



< 2v n + < + 



trace(E) 



Hence, when for some e > 0, v n goes to zero, which implies that 
we have with high probability, 



trace(S) 



0. 



max 

Ki<n 



|£V2y 4 ||2 



trace(E) 



1 



0. 



• Consistency of s n . 

We now assume that v n goes to zero for some e > and turn to showing 
the consistency of s n . First, let us note that 



A-H{Ai}"=i 



, IVAf^AT(0,S). 



Hence, by the same concentration arguments we just used, and using the 
fact that the Aj's are i.i.d. with E(A 2 ) = 1, we have 



IA ~~ A* 1 1 2 — > yp/n trace(E)/p. 



Now, 



ly ,-.||2 

\ A i Pll2 \2 



trace(E) 



< 



A 



trace (E) 



trace (E) 



* II 2 



trace (E) 



1 



+ 



trace (E) 



Also, the law of large numbers (for triangular arrays) imply that with prob- 
ability 1 

trace(E) 



So we can write 



2 , trace(E) 



trace(E) 



trace (E) 



\Xi - /ill 2 . 



trace (E) 



A; 



+ a; 



trace(E) 



trace (E) 



1 



and we have now all the terms on the right-hand side under control. 
In particular, it is clear that when v n — > 0, 
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With all these preliminaries behind us, let us now turn to the final part 
of the proof. Let us call 



1 n 1 

9n(x) = -y^T- 

n ^ l + x. 



1 1 + xXfp n 

With a slight abuse of notation, we note that s is the solution of goo(x) = 
1 — p. If x n is the solution of g n (x n ) = 1 — Pn, it is clear that x n is consistent 
for s: we can just use the fact that g n is decreasing and evaluate it at y\ 
and D2 which are on either sides of s. Clearly g n (yi) is consistent for goo(yi), 
and similarly for 1/2, so with high probability, x n needs to be in [2/1,2/2] 
asymptotically. 

Recall that the roots we are looking for are positive. So we have 

ft.w -»(«)- if; xp f-\ , 

and therefore, for x > 0, 

1 n — 

\g n (x) -g n {x)\ <xp n -^2\\f - Xf\. 

i=l 

By noting that g n (s n ) = l- p n = g n (x n ), we have 



1 - — 
<x n p n - V|A 2 - A? I -^0, 



n 

i=l 

since rr n is bounded above. 

Now since g n {x) is decreasing and is pointwise consistent for g oa (x), it is 
clear that we can find j/3, deterministic and bounded away from 00, such 
that asymptotically, s n < 1/3, with high probability. Also, g n (x) is convex, so 
this guarantees that \g' n \ can be bounded below (uniformly in n with high 
probability) on [0, max(y3, 2/2)] by a quantity that is strictly greater than 
with high probability. Note that this latter interval contains both x n and s n 
asymptotically. Using the mean value theorem, the fact that we have a lower 
bound (different from 0) on |c7^(x)| on [0,max(y3,j/2)] 5 an d the equation in 
the previous display, we can finally conclude that 

X n 5 n 7- 

with high probability, and since x n is consistent for s, so is 5 n . □ 

5.4.4. On robust estimates of scatter. We just saw that we could take 
advantage of the high-dimensionality of the problem to essentially estimate 
A 2 , by using concentration of measure arguments. This also allows us to 
propose estimates of scatter that are tailored for high-dimensional problems. 

In low-dimension, estimation of individual A 2 is not possible and a classic 
proposal for estimating the scatter matrix £ is Tyler's estimator [see Tyler 
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(1987)], which is the solution V n (defined up to scaling), of the equation 



It has been observed in a random matrix context [see Prahm and Jaekel 
(2005) and Biroli, Bouchaud and Potters (2007)] that when using Tyler's 
estimator in connection with elliptically distributed data, one seemed to re- 
cover a spectrum that looked similar to predictions of the Marcenko-Pastur 
law, at least in the case of Id p scatter. At this point, the evidence is mostly 
based on simulations though a rigorous proof seems feasible with a little 
bit of effort (the argument given in [Biroli, Bouchaud and Potters (2007)] 
is interesting though it falls short of a "full proof," which is acknowledged 
in that paper). We do not try to give a proof here because this is quite far 
from being the topic of this paper. 

As a high-dimensional alternative to Tyler's estimator, we could use 



One potential advantage of this proposal over Tyler's estimator is that 
Tyler's estimator is a priori not-defined when p > n, because it becomes 
impossible to invert V n . Also, this estimator is rather quick to compute and 
does not require multiple inversions of p x p matrices, where p is large [Tyler's 
estimator is generally found through an iterating procedure — see Frahm and 
Jaekel (2005) and references therein]. The spectral properties of V n are also 
quite easy to analyze in light of the detailed work we carried out concerning 
consistency of our estimator of 5. For instance in the simple case where \x 
is known, it is easy to see that under some conditions on S and the Aj's, 
the limiting spectral distribution of V n will satisfy a Marcenko-Pastur-type 
equation. (Because this is really tangential to our main points in the paper, 
we do not give further details.) 

Note that these estimates of scatter essentially make the influence of the 
Aj's on the problem disappear, at least as far as covariance (or really scat- 
ter) is concerned. So to answer a question asked by an insightful referee, 
it is reasonable to think that another approach might be to turn the prob- 
lem back to an essentially Gaussian problem by using an estimate of scatter 
instead of an estimate of covariance — if we ignore problems due to mean es- 
timation. Since in the Gaussian case, 5 = 1/(1 — p), corrections are relatively 
easy then. However, the impact of mean estimation needs to be investigated 
and furthermore, at this point there are no rigorous results that we know of 
(only very limited simulations) concerning the spectral properties of Tyler's 
estimator in high-dimension. So we leave further investigations of the prop- 
erties of these estimates of scatter to future work, as they are not a primary 




n 



i=i 
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concern in this already long paper (after all we have a provably consistent 
estimator that takes care of all the problems and is fast to compute). 

Let us however note that using estimates of scatter (instead of covariance) 
would likely yield a serious improvement in terms of the realized risk of port- 
folios which is discussed in the paper [El Karoui (2009b)]. However, these 
questions touch more on the issue of allocation, whereas we are concerned 
in this paper with estimating the efficient frontier and have shown that we 
can do this well (at least asymptotically and theoretically) independently of 
allocation issues, a fact that is potentially useful for, for instance, creating 
benchmarks. 

5.5. Numerical results and practical considerations. This subsection gives 
some numerical results to assess the quality of the proposed estimators for 
both weights and "efficient frontier." The simulation analysis is done in an a 
priori quite favorable case — the question being whether even then the theory 
could be useful in practice. 

Our aim was to investigate among other things the improvement in the 
quality of our approximations as n and p grew to infinity. Hence, we present 
the results of two simulation setups: one where n = 250, p = 100 and one 
where n = 2500, p = 1000. We chose to work with simulations where we 
picked both E and fi so that we could guarantee, for instance, that the 
efficient frontier was basically the same for both simulations. 

More specifically, we chose Stobeapxp Toeplitz matrix, with E(i, j) = 
a'* - - 7 ', where a = 0.4. In the smaller dimensional simulation, that is, p = 
100, we picked v\ to be the eigenvector associated with the 90th smallest 
eigenvalue of E. Calling fa the eigenvector associated with the 15th smallest 
eigenvalue of E, we picked = fi to be y/0.3vi + VoTffa. In the larger 
dimensional simulation, we used for v\ the eigenvector associated with the 
900th smallest eigenvalue of E, while fa was now associated with the 150th 
smallest eigenvalue of E. [i = vi was computed in the same fashion in both 
simulations. 

The simulations are here to illustrate "how large is large," that is, when 
the asymptotics kick-in and our theoretical predictions become accurate. 
The parameters were chosen so that we would be close to satisfying assump- 
tions A1-A5. Also, the choice of v\ and V2 guarantees that the off-diagonal 
elements of M are not zero, which we thought might make the problem 
easier and lead to overoptimistic pictures. (This choice of parameters is not 
motivated by a particular problem in Finance. We also note that if we knew 
that the covariance matrix were Toeplitz, we could resort to regularization 
methods to better solve the problem. However, if we applied the same ran- 
dom rotation to E, v\ and V2, it becomes less clear how one could use other 
approaches than the ones presented here for estimation.) 

We did simulations both in the Gaussian case and in the case of an ellipti- 
cal distribution as described above, that is, Xi = p, + XjT^I^Zi, where Aj was 
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proportional to a t-distributed random variables with 6 degrees of freedom 
and scaled to have variance 1. We picked 6 degrees of freedom to have sim- 
ulations with relatively heavy tails and capture visually the corresponding 
effects. It was also naturally a way to investigate the practical robustness 
of our estimators and compare with the Gaussian case. We call below the 
set of simulations involving the t-distribution the Hq" case because of its 
similarity with multivariate t-distributions. 

We repeated 1000 times the simulations in all the cases considered. We 
chose u\ = 1 and u<i (the "target returns" in a financial context) ranging 
from 0.1 to 5. 

We note that our estimators require taking inverses of matrices which 
naturally raises the question of how well conditioned those matrices are. 
This is particularly the case when we deal with M and M: if M is poorly 
conditioned, even though M is a good estimator of sM, it can turn out that 
M _1 is a relatively poor estimator of Ms. In our simulations, both M 
and S were well conditioned but in practice, one should be aware of potential 
difficulties that may arise if, for instance, M indicates that M may be ill 
conditioned. When this is the case, it is actually quite easy to make the 
estimators perform poorly (but of course this violates assumptions A1-A5). 

5.5.1. Estimation of portfolio weights. As we have seen earlier, the "naive' 
weights obtained by plugging-in the sample mean and the sample covariance 
matrix in our quadratic program with linear equality constraints are biased, 
in the sense that their projection in any given direction will generally be 
biased. 

Here we show the performance of our estimator as measured by its pro- 
jection on Vk = /x. It is a natural direction to consider since, for instance, 
in a financial context and under our modeling assumptions, it gives us the 
expected returns of our portfolio (conditional on X\, . . . ,X n ). 

As our limited simulations indicate, our estimator appears to be prac- 
tically unbiased here (even in the "lower-dimensional" case), which means 
in a financial context that the corresponding investment strategy will yield 
the returns that the investor expected. (We note that from a mean-variance 
point of view, we do not claim that our estimator is optimal. Work is under 
way to find better performing portfolios — but it requires a new set of theoret- 
ical investigations whose results are postponed to another paper. In limited 
simulations, it appeared that our "debiased" portfolio performed similarly 
to the naive one from a mean-variance point of view, its main advantage 
being that it delivers the returns that the investor expects.) 

We present two pictures, Figure 2, page 63 and Figure 3, page 64 to give a 
sense to the reader of the impact of the size of n and p on the estimators we 
proposed [the "larger-dimensional" case gives quite significantly better re- 
sults, with narrower confidence bands, though (empirical) near-unbiasedness 
is present in both cases]. 
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Markowitz: correction of portfolio, view of conditional expected returns 
n=250, p=100, "t-distribution" with 6 dofs. 




Target Returns 

Fig. 2. Performance of naive and corrected portfolios, for scaled '%" (top picture) and 
Gaussian returns. Here n = 250, p — 100 and the number of simulations is 1000. The dashed 
lines represent 95% confidence bands. The x-axis represents the returns an investor expects. 
The y-axis represents what she would actually get on average (i.e., n'w). The plots show 
both the bias in the naive solution (blue solid lines) and the fact that our estimator is 
nearly unbiased (red solid lines). They also illustrate the robustness of our corrections. 
The black line is very close to the red line, showing a very good correction (on average) in 
this setting where assumptions A1-A5 are satisfied. 
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Markowitz: correction of portfolio, view of conditional expected returns 
n=2500, p=1000, "t-distribution" with 6 dofs. 
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Markowitz: correction of portfolio, view of conditional expected returns 
n=2500, p=1000, Gaussian case 




) i 1 1 ' 1 1 — '''' 

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 

Target Returns 



Fig. 3. Performance of naive and corrected portfolios, for scaled "te" (left picture) and 
Gaussian returns. Here n = 2500, p — 1000 and the number of simulations is 1000. The 
dashed lines represent 95% confidence bands. The x-axis represents the returns an investor 
expects. The y-axis represents what she would actually get on average (i.e., fi'w ). The plots 
show both the bias in the naive solution (blue solid lines) and the fact that our estimator 
is nearly unbiased (red solid lines). They also illustrate the robustness of our corrections. 
Note the narrower confidence bands as compared to Figure 2. The black line is essentially 
hidden under the red line, showing a near perfect correction (on average) in this setting 
where assumptions A1-A5 are satisfied. 
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Markowitz: correction of Frontier, n=250, p=1 DO, "(-distribution" with 6 dofs. 



Markowitz: correction of Frontier, n=250, p=1 00, Gaussian case 
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- Truth 

- Mean Corrected Estimate 




Naive Estimate 

Truth 

Mean Corrected Estimate 



(a) 



(b) 



Markowitz: correction of Frontier, n-2500, p-1000, "t-distri button" with 6 dofs. 



Markowitz: correction of Frontier, n=2500, p=1 000, Gaussian 




- Mean Naive Estimate 

- Truth 

- Mean Corrected Estimate 




Mean Naive Estimate 
Truth 

Mean Corrected Estimate | 



(c) 



(d) 



Fig. 4. Performance of naive and corrected frontiers, for scaled "ts" [(a.) and (c)/ and 
Gaussian returns [(b) and (d)/. Here, in the left column n = 250 and p — 100. In the 
right column, n = 2500, p = 1000. The number of simulations is 1000 in all pictures. The 
dashed lines represent (empirical) 95% confidence bands. (The confidence bands corre- 
sponds are computed for a fixed y.) The x-axis represents our estimate of variance of the 
optimal portfolio. The y-axis represents the target returns for the portfolio. The plots show 
both the bias in the naive solution (blue solid curves) and the fact that our estimator is 
nearly unbiased (red solid curves near, or covering the black curve, the population solu- 
tion). They also illustrate the robustness of our corrections. Another striking feature is 
the lack of robustness of Gaussian computations, since the "efficient frontiers" computed 
with "te " returns are different from the Gaussian ones. The fact that, as our theoretical 
work predicts, Gaussian computations underestimate risk-underestimation in the class of 
elliptical distributions considered in the paper is illustrated by the fact that the "te " curves 
are to the left of the Gaussian curves. Note the narrower confidence bands in the larger 
dimensional simulations [ (c) and (d) /. The black line is essentially hidden under the red 
line in (c) and (d), showing a near perfect correction (on average) in this setting where 
assumptions A1-A5 are satisfied. 
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5.5.2. Correction to the frontier. We now turn to the issue of estimating 
the "efficient frontier," that is, the curve that represents the minima of our 
convex optimization problem (QP-eqc), on page 8. The pictures we present 
on Figure 3 (see page 65) were obtained from the simulations we described 
above. We chose to plot the variance (i.e., min w'Tiw) on the x-axis and the 
"target returns" [i.e., the u^s in the notation of equation (QP-eqc)] on the 
y-axis as this is the convention in financial applications. 

As the reader can see, our estimator turns out to be essentially unbiased, 
even in the "lower-dimensional" case. We note too that the variance can 
be quite large but that the confidence bands obtained from our corrections 
were always to the right of the confidence bands obtained from the naive 
estimator, meaning that if one is concerned with risk estimation that in 
(essentially) the worst case for our estimator, we still obtained a better per- 
forming estimator than in (essentially) the best case for the naive estimator. 
(We do not claim that this is always the case and it might be an artifact of 
the simulation setup chosen here.) 

Finally, for graphical purposes and to help comparisons, we chose to put 
all the graphs on the same scale. Some of the information on our original 
graphs (for the "lower-dimensional" case) was therefore left out but can be 
inferred by "naturally" extrapolating the curves shown on our graphs which 
are essentially parabolas. 

5.6. Remarks on inequality constraints. Our work has mostly been con- 
cerned with obtaining results for the case of a quadratic program with linear 
equality constraints. We now explain that our results can also be used to 
obtain approximation results concerning the case of a quadratic program 
with linear inequality constraints. 

In this subsection we therefore consider the problem 

{inf w'TjW, 
V'w e Q. 

Here Q is a subset of and V is a p x k matrix. We naturally want to 
relate the solution of the above problem to that of the empirical version of 
the problem: 

{inf w'T,w, 
v'w e Q. 

When Q is a product of intervals, we obtain a quadratic program with 
linear inequality constraints. But our formulation allows us to deal with even 
more complicated constraint structures. We note that if G(U) is the solution 
of problem (QP-ineqc-Pop) with Q = {U} (i.e., a singleton), where U is a 
vector in M fc , we are back in the case of the equality constrained problem 
that we worked with for most of this paper. Let us call G(U) the solution of 
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problem (QP-ineqc-Emp) with Q = {U}. We now make the simple following 
observation: note that 

/theo(Q) = $fG(U), 

f emp (Q)=M Q G(U). 

The main idea here is that we can find a deterministic equivalent to 
femp(Q) and we can relate this deterministic equivalent to /theo(Q)- 

Recall from Section 2 that G(U) = U'M~ l U and G(U) = U'M^U. Recall 
also that under the assumptions A1-A5 made at the beginning of this sec- 
tion, we have found a deterministic equivalent to M _1 : we have shown that 
M ~ sM + ^n e A;Cfc — Mq(s, K n ) in probability. The previous result is valid 
entry-wise, and since we assume that k stays bounded in the asymptotics 
we are considering, it is also valid in operator norm. Now, M is invertible 
with probability one under our assumptions, so we have, using the first re- 
solvent identity, that is, A" 1 - B" 1 = A' 1 (B - A)B~ 1 , 

\l\M- 1 -Mo\8,K)h < |||M- 1 ||| 2 |||M - 1 ( 5 , Kn )||| 2 |||M-M ( S , Kn )||| 2 . 

Hence, since |||M _1 |||2 remains bounded under our assumptions, 

|||M _1 - M _1 (5, k)||| 2 in probability. 

Under our assumptions, we also know that the smallest eigenvalue of M 
and Mq(s,k) stay bounded away from 0. Therefore, for any 5 > 0, we know 
that asymptotically, and with probability 1, 

\/UeR k \U'M- 1 U-U'Mo 1 (s,K n )U\ <S\\U\\l 

Furthermore, let us note that assumption A2 guarantees that |||M|||2 remains 

bounded and hence so do |||M|||2 and |||Mo(s, K n )|||2- 
We have the following theorem. 

Theorem 5.5. Suppose Gq and G n are maps from K fc to R + such that: 

1. Gq is deterministic and G n is possibly random. 

2. G (0) = G n (0) = 0. 

3. 3co > such that, \/U, Gq(U) > cq\\U\\2. Similarly, 3c n > such that, 
\/U , G n (U) > c n \\U\\2- Furthermore, c n — > cq with probability 1. 

4. 35 n such that 8 n —>0 in probability and \/U , \G n (U) — Go(U)\ < S n \\U\\2- 

Assume that k is fixed as n— >oo. Suppose Q is a (nonempty) subset o/R fe 
and that we can find Uq S Q such that Go(Uq) < oo and Uq^O. Then, 

lim inf G„(U) = inf Gn(U) in probability. 
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We have the following corollary: 

Corollary 5.6. When assumptions A1-A5 are satisfied 

/emp(Q) — > inf U'MrT (s, K n )U in probability. 
UeQ 

Hence, we have found a deterministic equivalent to f em p(Q)- It should 
also be noted that because £/'M -1 (s, K n )U < jM" 1 , we also have 

/emp(Q) < - inf U'M~ l U = -/theo(Q) with high probability, 
s c/eQ s 

Hence, our results on risk underestimation remain valid, even with these 
more general (nonequality) linear constraints. The comparison theorems be- 
tween Gaussian and elliptical assumptions remain also valid, because of sim- 
ilar comparison theorems for their deterministic equivalents. [Note also that 
when M ~sM (i.e., when the sample mean does not appear in V), the pre- 
vious inequalities become equalities.] Finally, our corrections also give a way 
to get a consistent estimator of Aheo(Q) : one can simply solve the optimiza- 
tion problem over Q with M replaced by l/s(M — K n &k e 'k) in the definition 
of G. 

Note that the corollary follows immediately from Theorem 5.5 because of 
our remarks on the operator norm of M and M _1 and their deterministic 
equivalents. 

Let us now prove Theorem 5.5. 

Proof of Theorem 5.5. Let us pick Uq in Q. We can do so because Q 
is nonempty. We assume without loss of generality that £ Q, for otherwise 
the problem is trivial, since is the global minimizer of both (deterministic 
and stochastic) probl ems. 

Let us pick ro = y/2Go(Uo)/co, with Uq ^ 0. Suppose that U ^ 5(0, ro), 
where 5(0, ro) is the closed ball of radius ro with center 0. Then, our as- 
sumptions on Go guarantee that 

G (U) > co\\Uf 2 > co^^ = 2G (Uo). 

CO 

So Uq S 5(0, ro). Also, if we call Q{tq) = Q C\ 5(0, ro), Q{tq) is nonempty 
and 

infG o (L0= inf G (U), 

U&Q UGQ(r ) 

because if U is outside of 5(0, ro), Gq{U) > Go(Uo). Now, suppose that 
{at}teT and {Pt}teT are two sets of real numbers. We have 

| inf ctj — inf fi^ \ < sup | a% — Pi \ . 



HIGH-DIMENSIONAL QUADRATIC PROGRAMS 69 

As a matter of fact, for any j, 

(inf a*) - (3j <aj - < \atj - f3j\ <sup|a fc -/3 fc |. 

k 

Now supj^inf 0^) — /3j] = (inf a^) — (inf fi^). And the previous display guar- 
antees that supj[(inf afc) — Pj] < sup fc \ otk~ fik\- By symmetry of the role of 
a and /3, we therefore have 

| inf aj — inf < sup \oti — j3i\. 

Hence, we can conclude that 

inf G (U)- inf G n {U) < sup \G (U) - G n {U)\ < 5 n rg, 

C/eQ(r ) ^GQ(ro) UeQ(ro) 

by our assumptions, and the fact that ||£/||2 < t"q in Q(ro). Hence, since tq 
stays fixed as n — > oo , 



in probability. 



inf G (CO - inf G n (C/) 

If we can show that with high-probability, 

inf G n (U) = inf G n (U), 

U£Q(r ) U&Q 

the result will be shown. First we note that if U ^ -6(0, ro), 

G n (U) > c n 2Go{Uo) > (1 + 5)G n (U ) 
co 

for some 5 > with high probability under our assumptions. Let us call E$ 
the event E s = {2c n /c > (1 + 5)G n (U )/Go(U )} . Of course, P{E S ) -)• 1 un- 
der our assumptions, since 2c n /co — >■ 2 in probability and G n (Uo) / Gq{Uq) — > 
1 in probability. When £5 is true, we have 

inf G n (L7)>(l + 5)G n (L/ )>G n (C7o)> inf G n (U). 

U£Bc(0,r )nQ U£Q(r ) 

So when Eg is true, and hence with high-probability, 

inf G n {U) = inf G n (U). 
U&Q UeQ(ro) 

We can finally conclude that 

inf G n (U) — > inf Gq(U) in probability, 

and the theorem is proved. □ 

6. Conclusion. This study of quadratic programs with linear constraints 
whose parameters are estimated from data has highlighted the difficulties 
created by the high-dimensionality of the data. In particular, we have shown 
that the fact that n (the number of observations used to estimate the param- 
eters) and p both grew to infinity lead to a systematic underestimation of the 
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minimal "risk" one exposed oneself to when approaching the optimization 
problem (QP-eqc-Pop) by solving its naive proxy (QP-eqc-Emp). 

Our study produced exact distributional results in the Gaussian case (Sec- 
tion 3) and convergence results in probability in the elliptical case (Sec- 
tion 4) , which also allowed us to reach conclusions for the bootstrap and the 
case of nonindependent data (in particular, it covers the case of Gaussian 
data correlated in time). As explained in Section 5, the study of the Gaus- 
sian case gives an over-optimistic assessment of risk underestimation in the 
context we study: in the class of elliptical distributions we consider, risk is 
minimally underestimated in the Gaussian case, and the situation is more 
dire for other elliptical distributions. Our study also highlights the fact that 
standard bootstrap estimates of bias will be inconsistent. It also suggests 
that in the case of correlated Gaussian observations, risk underestimation is 
likely to be more severe than in the i.i.d. case. 

Another benefit of our analysis is that it sheds light on what is creating 
those difficulties and allows us to propose robust corrections to these prob- 
lems. As shown in the theoretical part of the paper and illustrated in our 
limited simulation work, they are robust in the class of elliptical distributions 
we consider. They also appear to work reasonably well in practice (when the 
underlying assumptions hold), as our (somewhat limited) simulation work 
seems to indicate. 

Perhaps surprisingly, we did not need to make very strong assumptions 
about the covariance matrix at stake or the mean, whereas recent statistical 
work focused on estimation of covariance matrices [see El Karoui (2008) or 
Bickel and Levina (2008b)] tends to do so. This is in part because our theo- 
retical analysis clearly showed what functionals of these two parameters one 
needed to estimate, and hence we were able to bypass stronger requirements 
by focusing on those particular functionals and correcting the first order 
errors that appeared. In other words, even though our aim was to estimate 
a complicated function of the population covariance matrix and of the pop- 
ulation mean, for which we do not have good estimators in high-dimension 
in general, we were able to use poor estimators of both (and our theoretical 
analysis) to get an accurate estimator of the functional of interest. This is 
an interesting result in the context of high-dimensional statistics more gen- 
erally, as it suggests that we might be able to estimate certain functions 
of high-dimensional parameters without having to accurately estimate the 
parameters themselves [and hence we might be able to bypass in some situa- 
tions sparsity (or other similar requirements) for the population quantities]. 

Beside the interesting statistical and mathematical questions this study 
raised, we hope that it might also be helpful to, for instance, financial reg- 
ulators by perhaps providing them with more realistic benchmarks for the 
performance of optimal portfolios and that it sheds light on how the high- 
dimensionality of the data affects the proper assessment of risk of large 
portfolios obtained by solving high-dimensional optimization problems. 
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APPENDIX A: CLASSICAL RESULTS OF LINEAR ALGEBRA 

A.l. On inverses of partitioned matrices. In our study of the Gaussian 
case, and in particular in connection with properties of Wishart matrices, 
we relied several times on properties of the inverse of a partitioned matrix. 
Here is a detailed statement of what we needed. 

Let A be a generic matrix, and let us decompose it by blocks 



.4 



A u A 12 
A21 A22 



Let us call A 1 the inverse of A. We assume that all inverses we take are 
well defined. Let us write 

'A 11 A 12 ' 
A 21 A 2 



A 121 a22 



Then, it is well known that [see, e.g., Mardia, Kent and Bibby (1979), pages 
458-459, or Boyd and Vandenberghe (2004), page 650] 

(A.l) A n = (A ll -A l2 A^ 1 A 21 )-\ 

(A.2) A 22 = (A 22 -A 21 A n 1 A 12 r\ 

(A.3) A 12 = -A u l A 12 A 22 , 

(A.4) A 21 = -A 22 A 21 A n \ 



APPENDIX B: RANDOM MATRIX RESULTS 

B.l. Lower bounds on smallest eigenvalue. In many proofs in the course 
of the paper we needed to have quantitative bounds on the behavior of the 
smallest eigenvalue of a number of matrices and made repeated use of the 
following lemma. 

Lemma B.l. Suppose Y is a n x p matrix, with i.i.d. A/"(0, 1) entries, 
with p/n —7- p, and < p < 1. 

Suppose A is an n x n diagonal and deterministic matrix and that we 
can find N , C > and e > such that, if n is the ith largest eigenvalue 
of A' A, r^v > C, for some fixed C > 0. N is such that, for p and n large, 
p/N < 1 — e and N/n stays bounded away from 0. Finally, we assume that 
all the diagonal entries of A are different from 0. 

Call H = Id — 55' /n, where \\5\\ 2 = n. Then X p , the smallest eigenvalue of 
Y'A'HAY/n — 1, is bounded away from with high-probability. 

In particular, when p/N <1 — e, if < t n = C jrEj > 



P(yJ% < V^[(l - v 7 !^) - t]) < exp(-(iV - l)t 2 ). 
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The following proof makes clear that the result holds also when some of 
the diagonal entries of A are equal to zero if we make the following modifi- 
cation: n should now denote the number of nonzero entries on the diagonal 
of A, and the corresponding assumptions about p and N should then hold. 
We also point out that under our assumptions H is an orthogonal projection 
matrix. 

Proof of Lemma B.l. Before we start the proof per se, we need some 
notations: we call \k the A:th largest eigenvalue of a symmetric matrix. In 
other words, the eigenvalues are decreasingly ordered and Ai > A2 > • • • • 

The result is known if A = Id n , since 

-J—Y'HY = — *— W p (Id p , n-1). 
n — 1 n — 1 

Using Davidson and Szarek (2001), Theorem 11.13, we have the following 
result: the smallest eigenvalue of a matrix with distribution W(Id p ,no)/no 
is strongly concentrated around (1 — yp/no) 2 when p <uq, and 

P{y% < (1 - yW^o) - t) < exp(-n t 2 ). 

This gives our result in the case where A = Id n . Let us now investigate 
what happens when A is not Id n . 

The matrix M = A'HA is a rank-1 perturbation of A'A and is posi- 
tive semi-definite, because H is. Therefore, for any k > 2, \k-i(A'HA) = 
Afc-i(M) > Afc(A'A), by the interlacing Theorem 4.3.4 in Horn and John- 
son (1990). M has rank n — 1 matrix since, MA _1 <5 = and rank(M) > 
rank(A'A) - 1 = n - 1. 

We can diagonalize M = ODO' , where D has (n — 1) nonzero coefficients, 

and because O'Y = Y, we have 

n-1 

Y'MY = Y'A'HAY = Y'DY = diYiY-, 

i=l 

where di are the nonzero diagonal entries of D. Because M is positive semi- 
definite, we have di > for all i. In other respects, because for all k < n — 1, 
dk > Afc_|_i(A'A) = Tfc+i by our remark on interlacing inequalities. Hence, we 
have, if ^ denotes positive-semidefinite ordering, 

n-1 N-1 N-1 

d i Y i Y ( £ d * YiY i ^ t nY1 YiY i = r ivW p (Id p , N-1). 

i=l i=l i=l 

Therefore, we have in law, 

J— Y'A'HAY y C N ~] 1 . W p (Id p , N - 1). 
n — 1 n— 17V — 1 
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As we recalled above, the smallest eigenvalue of W p (Id p , N — l)/(N — 1) 
remains bounded away from with high-probability in our setting because 
p/N remains bounded away from 1 by assumption. We also assumed that 
N/n and C were bounded away from 0. If we call <£ = lim inf w ^oo C > we 
have £ > 0, and, for any rj > 0, according to the result of Davidson and Szarek 
(2001) we have, for A p = X^^Y' K' HNY) and £„ = C(N - l)/(n - 1), 

P(^K < - y/p/(N-l)) - t]) < exp(-(A - l)t 2 ). 

In particular, when p/N is such that p/N < 1 — e, 

P(y% < - Vl^~e) - t\) < exp(-(iV - l)t 2 ). 

Interestingly, this bound is "quite uniform" in A, in the sense that the 
only characteristics of A that matter are C n = 

C^Ef and N. □ 

APPENDIX C: GENERALIZATIONS OF THE PROOF OF 

THEOREM 4.3 

This part of the Appendix explains how to appropriately modify the 
proofs of Theorems 4.1 and 4.6 to obtain the results we need in the case of 
correlated observations (Section 4.3) and the bootstrap. 

C.l. On v''E~ 1 v when the observations are correlated. We explain in 
this subsection how to modify the proof of Theorem 4.1 in the case where 
the vectors of observations Xi and Xj are potentially correlated. The data 
was assumed to have the following representation, in matrix form: 

X = e/i' + A7S 1/2 , 

where A is n x n, deterministic but not necessarily diagonal and Y has i.i.d. 
jV(0, 1) entries. We also wrote the SVD of A as A = ADB', where A and B 
are orthogonal. 

If we call H = Id n — ee'/n, we have, of course, 

t = ——X'HX = — - E 1/2 Y'A'HAYT, 1/2 . 
n — 1 n — 1 

The orthogonality of B implies BY = Y, and we have 

t = ^-j^ 1/2 Y'D(A'HA)DY^ 2 . 

If we now call 5 = A'e, we see that = n, because A is orthogonal. It can 
also easily be seen that A' HA = Id n — 55' /n = Hg. Because of the remark we 
just made on the norm of 5, Hg is clearly an orthogonal projection matrix. 
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So we have to understand 

c 1 



-E^Y'DHsDYT, 1 / 2 , 



n — 1 

which is extremely close to the situation of Theorem 4.1, where we had to 
work with 

t = — - — Y}/ 2 Y' DH e DYY}l 2 . 
n — 1 

D now plays the role A played in Theorem 4.1 and the main modification is 
that H = H e is now replaced by H$. 

An examination of the proof of Theorem 4.1 shows that we never relied 
on the fact that we used specifically H e (instead of H$) in that proof. All 
we used was the fact that our H there was a rank-1 perturbation of Id n and 
an orthogonal projection matrix. Similarly, Lemma B.l, on which we relied 
in the course of the proof of Theorem 4.1, handles H$ for general 8 with 
squared norm n without any problems, so it is still usable in the course of 
the current study. 

Because we know that the squared singular values of A (and hence the 
eigenvalues of D) satisfy (Assumption-BB), the proof of Theorem 4.1 goes 
through without further modifications and Proposition 4.7 holds. 

C.2. On quadratic forms involving random projection matrices. A recur- 
rent issue in the questions we addressed was the understanding of statistics 
of the form 

-u'Pu, 
n 

where P is a random projection matrix and u a (generally deterministic) 
vector of dimension n. In particular, the projection matrices we dealt with 
were of the form 

P = AY(Y'A 2 Y)- 1 Y'A, 

for A a (possibly random) n x n diagonal matrix and Y an n x p matrix with 
i.i.d. A/"(0, 1) entries. We also assume that A and u are independent of Y. 
Finally, we assume that ||u||2/v / "' = 1- 

In the course of the text, we carried out successfully computations when 
u = e, but relied to do so on properties of trace(P). The case of general u is 
more involved and is treated here. 

Lemma C.l. Assume that A and u (which is deterministic) are such 
that 

n -t n 





^J2 u t^ Q and ^£ A ^ 

i=l i=l 

and that (Assumption-BB) holds for A for a certain sequence N(n). 
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Under the preceding assumptions, we have, if Z(u) = ^u'Pu, 
1 n 

Z(u) ^u 2 E(P(z,i)|A) -»0 in probability 

n i=l 

conditionally on A. 

Proof. We simply sketch the modifications to the proof given after the 
statement of Theorem 4.3. As noted in Lemma 4.4, the off-diagonal elements 
of P have mean conditionally on A. Now, using the same notations as in 
Theorem 4.3, we have, using equation (7) there, if Zi(u) is the quantity 
obtained by replacing Aj by in Z, ri = WiS~ 1 Yi, Wi = r^u/n and Ui is the 
ith coordinate of u, 

Z(u) = Zi(u) + — 1 {-X 2 w 2 + 2\iUiWi + XfuiWi). 

n *■ + \ 1i 

The expression between the parentheses is easily seen to be equal to (1 + 
\fqi)uf — (AjiOj — Ui) 2 . We get an analog of equation (8) 

Z(u) = Z t{ u) + i --- i + x ^ . 

Clearly, from the definition of Wi, Wi\{Y^, A} ~ jV(0, u'WiS~ 2 Wlu/n 2 ). 
Since by assumption \\u\\2 = y/n, we have 

< u'WiS^Wlu/n 2 = u'W l (W(W i )- 1 Wlu/n < 1 

because Wi(W-Wi) W( is an orthogonal projection matrix (hence its eigen- 
values are only and 1) and ||it/-^n||2 = 1. 

So we are exactly in the situation we were in during the proof of Theo- 
rem 4.3, except for a term in uf that now appears in our bound on the vari- 
ance. Hence, with our extra assumption on ||ti|||/n 2 , we conclude similarly 
(after a regularization step) that Z(u) converges in probability, conditional 
on A to its conditional mean which is simply 

iu 2 E(P(M)|A). □ 

We remark that to get an analog of Theorem 4.5, where now 

C = -u'AYS^v, 

n 

one just needs to go through the proof and replace the Wi appearing there 
by the "new" Wi = u' WiS^Yi/n. Exactly the same arguments go through 
when Yli=i u i^i/ n remains bounded. So under this condition, £ tends to 
zero in probability. 

With the help of the previous lemma, we can now prove the gist of Propo- 
sition 4.8. 
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FACT C.l. Proposition 4-8 holds. 

Proof. We note that Proposition 4.8 is essentially an application of the 
previous lemma, with appropriate change of notation. Recall the notations 
from the proposition. We have X = AYE^^and A, which is n x n, has singu- 
lar value decomposition ADB'. Also, S = X'X/n, Y = B'Y, F = Y'D 2 Y/n. 
Hence, in the language of the proposition, 

rti S~ l m = ire'ADYF^Y'DAe = u'Pu), 
n z 

where P = DY (Y' D 2 Y)~ l Y' D and oj = A'e. When the assumptions of the 
proposition are in force, A is deterministic and Lemma C.l applies; from 
which we conclude 

1 n 

m'S~ 1 rh N w 2 E(P(i, i)) — > in probability. 

This gives us the analog of Theorem 4.3. 

To get the analog of Theorem 4.5, we just need ^H=\ tjJ i^,l n t° remain 
bounded, which is an assumption stated in Proposition 4.8. □ 

C.3. Bootstrap specific results. 

Bootstrapping mean Gaussian data. Our analysis of the bootstrap 
problem requires an analysis similar to the one we performed in the pre- 
vious subsection. In particular, there we have u = A 1 ' 2 e, where A contains 
the bootstrap weights. Since those add- up to n, the assumption HujH = n 
was clearly satisfied. Also, in the situation where p/n — > p G (0, 1 — 1/e), we 
are guaranteed that 

P* =A 1 / 2 Y(Y'AY)- 1 Y'A 1 / 2 

is well defined with high-probability. When conditioning on A, we see that 
we can work only with the submatrix A* (of size n*) whose diagonal entries 
are nonzero. This submatrix has its diagonal entries bounded away from 
as they are at least equal to 1. Also, using arguments similar to those given 
in the proof of Lemma B.l, we see that we can get a uniform (in A) lower 
bound on the smallest singular value of AY, which holds with probability 
exponentially [in (n* — p)\ close to 1. 

So now we assume that we are dealing with A such that n* — p tends to 
oo, the empirical distribution of A goes to Po(l) and ^ A?/n 2 — > 0. We also 
assume that (Assumption-BB) are satisfied for this A. Finally, we assume 
that {^ILi ^i/ n < 10}- We call the corresponding set of matrices Gb„- When 
the diagonal entries of A are drawn from a multinomial(^, re) it is clear 
that these conditions are satisfied with probability going to 1. The only thing 
that might require an explanation is why the condition {X^i^i/ 71 — 10} 
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holds with probability going to 1. The mean of Y27=i ^i/ n cl earr y goes to 2, 
using the marginal distribution of Aj. On the other hand, the arguments we 
gave in Proposition 4.10 show that its variance goes to 0, so this quantity 
goes to 2 in probability and therefore is less than 10 with probability going 
to 1. 

The main question that we still have to address is that of the behavior of 

1 n 

-J>?e(p*(m)|A) 

1=1 

when u? = Aj. By definition, 

P*U i) = —XiY- [ - V XiYiY'} Yi = l , 



P 



i=i 

where Si = XjYjYj . Now concentration arguments (see, e.g., Sec- 

tion 5.6) show that, if cr p (Si) is the smallest singular value of Si, 

YIS^Y _ traced) > A = 0(exp( _ ptV 2 (5i)/2)) . 
p p J F 

We also know that with overwhelming probability (measured over Yrq = 
{Yi, . . . , Yi-i, Yi + i, . . . , Y n }), (J p {Si) is bounded away from 0, conditionally 
on A, when A is such that (Assumption-BB) holds. (Note for instance that 
Si >z ^2i^jYjYj /n and use Lemma B.l.) Hence, we conclude that 

YjSr 1 * ^ traced 1 ) ^ 
p p 

Hence, conditionally on A, 

1 



P*(M) - 1 



l + AiCp/^^race^r 1 )/^)' 

with very high-probability, that is, the probability that the difference be- 
tween the two is greater than Xi(p/n)t is 0(exp(— C(n* — p)t 2 )) for a fixed 
C (by arguments similar to those given in Lemma B.l). In other respects, 
we note that rank-1 perturbation arguments give, if S = ^-Y'AY, 

traced 1 ) - tracefS" 1 ) = Xih^Xl . 

In particular, when A is such that (Assumption-BB) holds, by using a union 
bound argument, 

tracef^S" 1 ) — trace(S -1 ^ 



P ( max 

. i=l,...,n 



>e|A -^0. 



P 

We also note that trace(5~ 1 )/p — > s conditionally on A, if A is such that its 
empirical distribution goes to Po(l), A E GB n and p/n — > p. 
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Therefore, we also have by a simple union bound argument, conditional 
on A, and assuming that A is such that its empirical distribution goes to 
Po(l), A G G Bn and hence £ A 2 / n is less than 10, 
n n , 

n ^ K ' n 4-i 1 + A l/0n s 

Now when A ==> Po(l), which we write G, and p n — > p, 

1 A A, /* t dG[r) 

n f-f 1 + Aip n s 7 1 + rps ' 

But in light of the Marcenko-Pastur equation, we have, under these circum- 
stances, 



i5>JP*(i,i) -n-- = —. 



n 

i=l 



We finally conclude that conditional on A being in g n (whose probability 
goes to 1), 



Gu*)'(x*ny -> - (5 . = s - 1 > 



l-(s-l)/ S -l-p' 

since we know that 5 > 1/(1 — p) when G is Po(l), since its mean is 1. 

Similar arguments as the ones used in the proofs in the main body of the 
paper show that the same convergence in probability result holds uncondi- 
tionally on A, the problem being to get bounds that are uniform in A, when 

Hence, an analog of Theorem 4.3 follows (with V n probability going to 1), 
where the ratio p/(l — p) is replaced by s — 1. The analog of Theorem 4.5 
follows from the arguments given in Appendix C.2, if we can show, in the 
notation used there that ^r=i( M «°^) 2 / n remains bounded with probability 
going to 1. Note that Uj = di = yf\i here, where Aj are the bootstrap weights, 
so we just need to show that Y^i=i ^i/ n remains bounded. But we did this 
when describing Q s n ■ 

We therefore have an analog of Theorem 4.5 and also of Theorem 4.6 
when bootstrapping Gaussian data. 

Bootstrapping elliptically distributed data. Finally, let us say a few words 
about what would happen if we replaced the normality assumption for the 
Xi's by an elliptical distribution assumption. We focus on the case where 
Xi = AiS 1 / 2 ^, that is, the mean of the Aj's is 0. The previous analyses make 
clear that the key questions concern v'(Y 1 *)~ l v and (/}*)'(S*) _1 /i*. 

The questions concerning v'(Y>*)~ l v fall pretty much directly under the 
study we have made of elliptical distributions, since we know, according to 
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the proof of Theorem 4.12, that 
n — 1 

where D is the diagonal matrix containing the bootstrap weights and 5 = 
So, as long as D l / 2 A satisfies (Assumption-BB), results similar to 
Theorem 4.12 will hold. 

The questions dealing with (/}*)' ft* are more involved. Analyses 
similar to the ones performed above show that the key quantity to under- 
stand is now 

-e'DAY(Y'A'DAY)- 1 Y'A'De=-uP r)1/2A Y u, 

n n ' 

where P D i/2 A>Y = D 1 / 2 AY(Y'A'DAY)- 1 Y'AD 1 / 2 and u = D^ 2 e. The anal- 
ysis of this quadratic form can be carried out just like we did above in the 
Gaussian case, that is, A = Id n . However, the remarks we made to get sim- 
plified expressions for the limit do not seem to apply anymore: quantities of 
the type 

1 A dj 
nj^l + Xjdips 1 

appear, where s is the solution of equation (4) with G being the limit (if 
it exists) of the empirical distribution of the random variables X 2 d{. These 
quantities do not appear to simplify any further to yield a clearer and more 
exploitable expression. 
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